Closed JMencius closed 4 months ago
Hi @JMencius, That is much slower than expected. If you could share the stderr output of the command that would be useful. As only one core is being used it's probable the pipeline has got stuck processing one large component job. This can sometimes happen at ALT chromosomes, so you could try excluding those chromosomes and running again.
It this issue still persists, then would you mind running this script to see where it is getting stuck:
cat ref.fa.fai | while read line
do
chrom=$(echo $line | cut -d " " -f 1)
length=$(echo $line | cut -d " " -f 2)
echo "${chrom} 1 ${length}" > ${chrom}.search.bed
echo ${chrom}
dysgu run -p 12 --mode nanopore --max-cov 500 -x \
--search ${chrom}.search.bed \
ref.fa temp_${chrom} bam \
> ${chrom}.dysgu.vcf 2>${chrom}.stderr
done
Also you might want to set the --max-cov
parameter to auto
or some other suitable number, the default is only 200 for long-reads which will result in some regions being excluded with your 174X coverage.
Hi @kcleal Thank you for the quick respone. The standard out put of the command is:
The speicific directory is masked for privacy.
2024-02-27 09:12:51,436 [INFO ] [dysgu-call] Version: 1.6.2
2024-02-27 09:12:51,436 [INFO ] Input file is: {INTPUT_DIR}/dorado0.4.1_hac_hg38_unsorted.bam
2024-02-27 09:12:51,436 [INFO ] call -p 12 --mode nanopore {REF_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta {TEMP_DIR}/temp {BAM_DIR}/dorado0.4.1_hac_hg38_unsorted.bam
2024-02-27 09:12:51,523 [WARNING] Warning: no @RG, using input file name as sample name for output: dorado0.4.1_hac_hg38_unsorted
2024-02-27 09:12:51,524 [INFO ] Sample name: dorado0.4.1_hac_hg38_unsorted
2024-02-27 09:12:51,524 [INFO ] Writing vcf to stdout
2024-02-27 09:12:51,524 [INFO ] Running pipeline
2024-02-27 09:12:51,524 [INFO ] Sequence divergence upper bound 0.02
2024-02-27 09:12:51,524 [INFO ] Building graph with clustering 500000 bp
I will try excluding those ALT chromosomes first.
Additionally, files in the temp file is listed:
chr10.dysgu_chrom.bin chr1_KI270708v1_random.dysgu_chrom.bin chr5.dysgu_chrom.bin chrUn_KI270330v1.dysgu_chrom.bin chrUn_KI270584v1.dysgu_chrom.bin
chr11.dysgu_chrom.bin chr1_KI270709v1_random.dysgu_chrom.bin chr5_GL000208v1_random.dysgu_chrom.bin chrUn_KI270333v1.dysgu_chrom.bin chrUn_KI270588v1.dysgu_chrom.bin
chr11_KI270721v1_random.dysgu_chrom.bin chr1_KI270710v1_random.dysgu_chrom.bin chr6.dysgu_chrom.bin chrUn_KI270337v1.dysgu_chrom.bin chrUn_KI270589v1.dysgu_chrom.bin
chr12.dysgu_chrom.bin chr1_KI270711v1_random.dysgu_chrom.bin chr7.dysgu_chrom.bin chrUn_KI270366v1.dysgu_chrom.bin chrUn_KI270590v1.dysgu_chrom.bin
chr13.dysgu_chrom.bin chr1_KI270712v1_random.dysgu_chrom.bin chr8.dysgu_chrom.bin chrUn_KI270395v1.dysgu_chrom.bin chrUn_KI270591v1.dysgu_chrom.bin
chr14.dysgu_chrom.bin chr1_KI270713v1_random.dysgu_chrom.bin chr9.dysgu_chrom.bin chrUn_KI270420v1.dysgu_chrom.bin chrUn_KI270593v1.dysgu_chrom.bin
chr14_GL000009v2_random.dysgu_chrom.bin chr1_KI270714v1_random.dysgu_chrom.bin chr9_KI270717v1_random.dysgu_chrom.bin chrUn_KI270422v1.dysgu_chrom.bin chrUn_KI270741v1.dysgu_chrom.bin
chr14_GL000194v1_random.dysgu_chrom.bin chr20.dysgu_chrom.bin chr9_KI270718v1_random.dysgu_chrom.bin chrUn_KI270435v1.dysgu_chrom.bin chrUn_KI270742v1.dysgu_chrom.bin
chr14_GL000225v1_random.dysgu_chrom.bin chr21.dysgu_chrom.bin chr9_KI270719v1_random.dysgu_chrom.bin chrUn_KI270438v1.dysgu_chrom.bin chrUn_KI270743v1.dysgu_chrom.bin
chr14_KI270722v1_random.dysgu_chrom.bin chr22.dysgu_chrom.bin chr9_KI270720v1_random.dysgu_chrom.bin chrUn_KI270442v1.dysgu_chrom.bin chrUn_KI270744v1.dysgu_chrom.bin
chr14_KI270723v1_random.dysgu_chrom.bin chr22_KI270731v1_random.dysgu_chrom.bin chrEBV.dysgu_chrom.bin chrUn_KI270448v1.dysgu_chrom.bin chrUn_KI270745v1.dysgu_chrom.bin
chr14_KI270724v1_random.dysgu_chrom.bin chr22_KI270732v1_random.dysgu_chrom.bin chrM.dysgu_chrom.bin chrUn_KI270465v1.dysgu_chrom.bin chrUn_KI270746v1.dysgu_chrom.bin
chr14_KI270725v1_random.dysgu_chrom.bin chr22_KI270733v1_random.dysgu_chrom.bin chrUn_GL000195v1.dysgu_chrom.bin chrUn_KI270466v1.dysgu_chrom.bin chrUn_KI270747v1.dysgu_chrom.bin
chr15.dysgu_chrom.bin chr22_KI270734v1_random.dysgu_chrom.bin chrUn_GL000213v1.dysgu_chrom.bin chrUn_KI270467v1.dysgu_chrom.bin chrUn_KI270748v1.dysgu_chrom.bin
chr15_KI270727v1_random.dysgu_chrom.bin chr22_KI270735v1_random.dysgu_chrom.bin chrUn_GL000214v1.dysgu_chrom.bin chrUn_KI270468v1.dysgu_chrom.bin chrUn_KI270749v1.dysgu_chrom.bin
chr16.dysgu_chrom.bin chr22_KI270736v1_random.dysgu_chrom.bin chrUn_GL000216v2.dysgu_chrom.bin chrUn_KI270507v1.dysgu_chrom.bin chrUn_KI270750v1.dysgu_chrom.bin
chr16_KI270728v1_random.dysgu_chrom.bin chr22_KI270737v1_random.dysgu_chrom.bin chrUn_GL000218v1.dysgu_chrom.bin chrUn_KI270509v1.dysgu_chrom.bin chrUn_KI270751v1.dysgu_chrom.bin
chr17.dysgu_chrom.bin chr22_KI270738v1_random.dysgu_chrom.bin chrUn_GL000219v1.dysgu_chrom.bin chrUn_KI270511v1.dysgu_chrom.bin chrUn_KI270753v1.dysgu_chrom.bin
chr17_GL000205v2_random.dysgu_chrom.bin chr22_KI270739v1_random.dysgu_chrom.bin chrUn_GL000220v1.dysgu_chrom.bin chrUn_KI270512v1.dysgu_chrom.bin chrUn_KI270754v1.dysgu_chrom.bin
chr17_KI270729v1_random.dysgu_chrom.bin chr2.dysgu_chrom.bin chrUn_GL000224v1.dysgu_chrom.bin chrUn_KI270515v1.dysgu_chrom.bin chrUn_KI270755v1.dysgu_chrom.bin
chr17_KI270730v1_random.dysgu_chrom.bin chr2_KI270715v1_random.dysgu_chrom.bin chrUn_GL000226v1.dysgu_chrom.bin chrUn_KI270516v1.dysgu_chrom.bin chrUn_KI270756v1.dysgu_chrom.bin
chr18.dysgu_chrom.bin chr2_KI270716v1_random.dysgu_chrom.bin chrUn_KI270303v1.dysgu_chrom.bin chrUn_KI270519v1.dysgu_chrom.bin chrUn_KI270757v1.dysgu_chrom.bin
chr19.dysgu_chrom.bin chr3.dysgu_chrom.bin chrUn_KI270305v1.dysgu_chrom.bin chrUn_KI270521v1.dysgu_chrom.bin chrX.dysgu_chrom.bin
chr1.dysgu_chrom.bin chr3_GL000221v1_random.dysgu_chrom.bin chrUn_KI270311v1.dysgu_chrom.bin chrUn_KI270538v1.dysgu_chrom.bin chrY.dysgu_chrom.bin
chr1_KI270706v1_random.dysgu_chrom.bin chr4.dysgu_chrom.bin chrUn_KI270320v1.dysgu_chrom.bin chrUn_KI270579v1.dysgu_chrom.bin chrY_KI270740v1_random.dysgu_chrom.bin
chr1_KI270707v1_random.dysgu_chrom.bin chr4_GL000008v2_random.dysgu_chrom.bin chrUn_KI270322v1.dysgu_chrom.bin chrUn_KI270582v1.dysgu_chrom.bin
The total size of the temp file fluctuate around 500 MB.
Thanks @JMencius, The pipeline is getting stuck at the graph building phase, suggesting there are a lot of candidate SVs (although I can't rule out a bug either). The minimum SV length that dysgu tries to detect is 30bp, so given the high coverage, and the fact you are using nanopore reads, Im wondering if there are lots of reads with gaps (indels) that exceed the minimum size threshold? If these occur relatively uniformly across the genome, it could lead to runtime problems, as every gap is considered a candidate. It might be worth checking if this is the case, just by inspecting the data using IGV or GW.
If there are lots of small gaps in your reads, you can try increasing the --min-size
parameter, perhaps to 50 which is common for other tools.
The .bin files just record coverage informations, so are not so interesting.
Thanks @kcleal I will try add --min-size 50 and --max-cov auto to my command. I will let you know the results ASAP.
Hi @JMencius,
Further to my last message, you should also probably increase the --min-support
parameter, on account of your high coverage. Setting to --min-support 10
would be sensible.
Thanks @kcleal After adding --min-size 50 and --max-cov auto to my command, the DYSGU call run finished in less than 30 minues (12 cores). I will futher add --min-support 10 to my command.
Hi @kcleal I figure out why the prolonged run time. If I run DYSGU on a unsorted and unindex BAM file, it just stuck forever. After indexing the BAM file, even the default parameter works well. Maybe samtools indexing should be highlighted in the tutorial?
Ah ok, that makes sense. Glad it works as expected.
Sent from Outlook for Androidhttps://aka.ms/AAb9ysg
From: JMencius @.> Sent: Saturday, March 2, 2024 7:48:56 AM To: kcleal/dysgu @.> Cc: Kez Cleal @.>; Mention @.> Subject: Re: [kcleal/dysgu] Long run time (Issue #84)
External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
Hi @kclealhttps://github.com/kcleal I figure out why the prolonged run time. If I run DYSGU on a unsorted and unindex BAM file, it just stuck forever. After indexing the BAM file, even the default parameter works well. Maybe samtools indexing should be highlighted in the tutorial?
— Reply to this email directly, view it on GitHubhttps://github.com/kcleal/dysgu/issues/84#issuecomment-1974675033, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKIBQHJD36ARUXL4NUDIIXLYWF76RAVCNFSM6AAAAABD7I2TNCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZUGY3TKMBTGM. You are receiving this because you were mentioned.Message ID: @.***>
@kcleal Maybe I can contribute to dysgu
to fix the problem which raise no error when no bam index is provided. Do you and your team accept pull request?
Hi @JMencius, Yes, happy to accept a PR, thanks
Also worth noting that dysgu can work without an index, by streaming reads straight to dysgu, so this is a supported use case. However, I think in some situations, an index is required.
I think an error should always be raised if the bam file is not in position sorted order, so this could be straight forward to add
I think an error should always be raised if the bam file is not in position sorted order, so this could be straight forward to add
Maybe I can do sth about it
Distinguished DYSGU developers,
I am new to DYSGU software, and I humbly ask for you help for why the prolong run time for DYSGU.
First, I use Minimap2 to align my raw nanopore FASTQ sequnce (X174 coverage) to the reference genome and output the aligned SAM file. I filter the SAM file with -F 4069 with samtools and converted to BAM at the same time. I use this BAM file (filtered.bam) as the DYSGU input.
My command for running DYSGU is :
dysgu call -p 12 --mode nanopore GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta temp filtered.bam
The run took >52 hours (still running) on dual Intel Xeon (with HDD array) and DYSGU only use single core mostly, which I think the run time is too long for you mentioned "Using a single core and depending on hard-drive speed, dysgu usually takes ~1h to analyse a 30X coverage genome". Also the program does not output any error and output nothing after "Building graph with clustering 500000 bp".
Do you have any ideas?