kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
88 stars 10 forks source link

Long run time #84

Closed JMencius closed 4 months ago

JMencius commented 4 months ago

Distinguished DYSGU developers,

I am new to DYSGU software, and I humbly ask for you help for why the prolong run time for DYSGU.

First, I use Minimap2 to align my raw nanopore FASTQ sequnce (X174 coverage) to the reference genome and output the aligned SAM file. I filter the SAM file with -F 4069 with samtools and converted to BAM at the same time. I use this BAM file (filtered.bam) as the DYSGU input.

My command for running DYSGU is :

dysgu call -p 12 --mode nanopore GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta temp filtered.bam

The run took >52 hours (still running) on dual Intel Xeon (with HDD array) and DYSGU only use single core mostly, which I think the run time is too long for you mentioned "Using a single core and depending on hard-drive speed, dysgu usually takes ~1h to analyse a 30X coverage genome". Also the program does not output any error and output nothing after "Building graph with clustering 500000 bp".

Do you have any ideas?

kcleal commented 4 months ago

Hi @JMencius, That is much slower than expected. If you could share the stderr output of the command that would be useful. As only one core is being used it's probable the pipeline has got stuck processing one large component job. This can sometimes happen at ALT chromosomes, so you could try excluding those chromosomes and running again.

It this issue still persists, then would you mind running this script to see where it is getting stuck:

cat ref.fa.fai | while read line
do

    chrom=$(echo $line | cut -d " " -f 1)
    length=$(echo $line | cut -d " " -f 2)
    echo "${chrom}      1       ${length}" > ${chrom}.search.bed
    echo ${chrom}

    dysgu run -p 12 --mode nanopore --max-cov 500 -x \
    --search ${chrom}.search.bed \
    ref.fa temp_${chrom} bam \
    > ${chrom}.dysgu.vcf 2>${chrom}.stderr

done

Also you might want to set the --max-cov parameter to auto or some other suitable number, the default is only 200 for long-reads which will result in some regions being excluded with your 174X coverage.

JMencius commented 4 months ago

Hi @kcleal Thank you for the quick respone. The standard out put of the command is:

The speicific directory is masked for privacy.

2024-02-27 09:12:51,436 [INFO   ]  [dysgu-call] Version: 1.6.2
2024-02-27 09:12:51,436 [INFO   ]  Input file is:  {INTPUT_DIR}/dorado0.4.1_hac_hg38_unsorted.bam
2024-02-27 09:12:51,436 [INFO   ]  call -p 12 --mode nanopore {REF_DIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta {TEMP_DIR}/temp {BAM_DIR}/dorado0.4.1_hac_hg38_unsorted.bam
2024-02-27 09:12:51,523 [WARNING]  Warning: no @RG, using input file name as sample name for output: dorado0.4.1_hac_hg38_unsorted
2024-02-27 09:12:51,524 [INFO   ]  Sample name: dorado0.4.1_hac_hg38_unsorted
2024-02-27 09:12:51,524 [INFO   ]  Writing vcf to stdout
2024-02-27 09:12:51,524 [INFO   ]  Running pipeline
2024-02-27 09:12:51,524 [INFO   ]  Sequence divergence upper bound 0.02
2024-02-27 09:12:51,524 [INFO   ]  Building graph with clustering 500000 bp

I will try excluding those ALT chromosomes first.

JMencius commented 4 months ago

Additionally, files in the temp file is listed:

chr10.dysgu_chrom.bin                    chr1_KI270708v1_random.dysgu_chrom.bin   chr5.dysgu_chrom.bin                    chrUn_KI270330v1.dysgu_chrom.bin  chrUn_KI270584v1.dysgu_chrom.bin
chr11.dysgu_chrom.bin                    chr1_KI270709v1_random.dysgu_chrom.bin   chr5_GL000208v1_random.dysgu_chrom.bin  chrUn_KI270333v1.dysgu_chrom.bin  chrUn_KI270588v1.dysgu_chrom.bin
chr11_KI270721v1_random.dysgu_chrom.bin  chr1_KI270710v1_random.dysgu_chrom.bin   chr6.dysgu_chrom.bin                    chrUn_KI270337v1.dysgu_chrom.bin  chrUn_KI270589v1.dysgu_chrom.bin
chr12.dysgu_chrom.bin                    chr1_KI270711v1_random.dysgu_chrom.bin   chr7.dysgu_chrom.bin                    chrUn_KI270366v1.dysgu_chrom.bin  chrUn_KI270590v1.dysgu_chrom.bin
chr13.dysgu_chrom.bin                    chr1_KI270712v1_random.dysgu_chrom.bin   chr8.dysgu_chrom.bin                    chrUn_KI270395v1.dysgu_chrom.bin  chrUn_KI270591v1.dysgu_chrom.bin
chr14.dysgu_chrom.bin                    chr1_KI270713v1_random.dysgu_chrom.bin   chr9.dysgu_chrom.bin                    chrUn_KI270420v1.dysgu_chrom.bin  chrUn_KI270593v1.dysgu_chrom.bin
chr14_GL000009v2_random.dysgu_chrom.bin  chr1_KI270714v1_random.dysgu_chrom.bin   chr9_KI270717v1_random.dysgu_chrom.bin  chrUn_KI270422v1.dysgu_chrom.bin  chrUn_KI270741v1.dysgu_chrom.bin
chr14_GL000194v1_random.dysgu_chrom.bin  chr20.dysgu_chrom.bin                    chr9_KI270718v1_random.dysgu_chrom.bin  chrUn_KI270435v1.dysgu_chrom.bin  chrUn_KI270742v1.dysgu_chrom.bin
chr14_GL000225v1_random.dysgu_chrom.bin  chr21.dysgu_chrom.bin                    chr9_KI270719v1_random.dysgu_chrom.bin  chrUn_KI270438v1.dysgu_chrom.bin  chrUn_KI270743v1.dysgu_chrom.bin
chr14_KI270722v1_random.dysgu_chrom.bin  chr22.dysgu_chrom.bin                    chr9_KI270720v1_random.dysgu_chrom.bin  chrUn_KI270442v1.dysgu_chrom.bin  chrUn_KI270744v1.dysgu_chrom.bin
chr14_KI270723v1_random.dysgu_chrom.bin  chr22_KI270731v1_random.dysgu_chrom.bin  chrEBV.dysgu_chrom.bin                  chrUn_KI270448v1.dysgu_chrom.bin  chrUn_KI270745v1.dysgu_chrom.bin
chr14_KI270724v1_random.dysgu_chrom.bin  chr22_KI270732v1_random.dysgu_chrom.bin  chrM.dysgu_chrom.bin                    chrUn_KI270465v1.dysgu_chrom.bin  chrUn_KI270746v1.dysgu_chrom.bin
chr14_KI270725v1_random.dysgu_chrom.bin  chr22_KI270733v1_random.dysgu_chrom.bin  chrUn_GL000195v1.dysgu_chrom.bin        chrUn_KI270466v1.dysgu_chrom.bin  chrUn_KI270747v1.dysgu_chrom.bin
chr15.dysgu_chrom.bin                    chr22_KI270734v1_random.dysgu_chrom.bin  chrUn_GL000213v1.dysgu_chrom.bin        chrUn_KI270467v1.dysgu_chrom.bin  chrUn_KI270748v1.dysgu_chrom.bin
chr15_KI270727v1_random.dysgu_chrom.bin  chr22_KI270735v1_random.dysgu_chrom.bin  chrUn_GL000214v1.dysgu_chrom.bin        chrUn_KI270468v1.dysgu_chrom.bin  chrUn_KI270749v1.dysgu_chrom.bin
chr16.dysgu_chrom.bin                    chr22_KI270736v1_random.dysgu_chrom.bin  chrUn_GL000216v2.dysgu_chrom.bin        chrUn_KI270507v1.dysgu_chrom.bin  chrUn_KI270750v1.dysgu_chrom.bin
chr16_KI270728v1_random.dysgu_chrom.bin  chr22_KI270737v1_random.dysgu_chrom.bin  chrUn_GL000218v1.dysgu_chrom.bin        chrUn_KI270509v1.dysgu_chrom.bin  chrUn_KI270751v1.dysgu_chrom.bin
chr17.dysgu_chrom.bin                    chr22_KI270738v1_random.dysgu_chrom.bin  chrUn_GL000219v1.dysgu_chrom.bin        chrUn_KI270511v1.dysgu_chrom.bin  chrUn_KI270753v1.dysgu_chrom.bin
chr17_GL000205v2_random.dysgu_chrom.bin  chr22_KI270739v1_random.dysgu_chrom.bin  chrUn_GL000220v1.dysgu_chrom.bin        chrUn_KI270512v1.dysgu_chrom.bin  chrUn_KI270754v1.dysgu_chrom.bin
chr17_KI270729v1_random.dysgu_chrom.bin  chr2.dysgu_chrom.bin                     chrUn_GL000224v1.dysgu_chrom.bin        chrUn_KI270515v1.dysgu_chrom.bin  chrUn_KI270755v1.dysgu_chrom.bin
chr17_KI270730v1_random.dysgu_chrom.bin  chr2_KI270715v1_random.dysgu_chrom.bin   chrUn_GL000226v1.dysgu_chrom.bin        chrUn_KI270516v1.dysgu_chrom.bin  chrUn_KI270756v1.dysgu_chrom.bin
chr18.dysgu_chrom.bin                    chr2_KI270716v1_random.dysgu_chrom.bin   chrUn_KI270303v1.dysgu_chrom.bin        chrUn_KI270519v1.dysgu_chrom.bin  chrUn_KI270757v1.dysgu_chrom.bin
chr19.dysgu_chrom.bin                    chr3.dysgu_chrom.bin                     chrUn_KI270305v1.dysgu_chrom.bin        chrUn_KI270521v1.dysgu_chrom.bin  chrX.dysgu_chrom.bin
chr1.dysgu_chrom.bin                     chr3_GL000221v1_random.dysgu_chrom.bin   chrUn_KI270311v1.dysgu_chrom.bin        chrUn_KI270538v1.dysgu_chrom.bin  chrY.dysgu_chrom.bin
chr1_KI270706v1_random.dysgu_chrom.bin   chr4.dysgu_chrom.bin                     chrUn_KI270320v1.dysgu_chrom.bin        chrUn_KI270579v1.dysgu_chrom.bin  chrY_KI270740v1_random.dysgu_chrom.bin
chr1_KI270707v1_random.dysgu_chrom.bin   chr4_GL000008v2_random.dysgu_chrom.bin   chrUn_KI270322v1.dysgu_chrom.bin        chrUn_KI270582v1.dysgu_chrom.bin

The total size of the temp file fluctuate around 500 MB.

kcleal commented 4 months ago

Thanks @JMencius, The pipeline is getting stuck at the graph building phase, suggesting there are a lot of candidate SVs (although I can't rule out a bug either). The minimum SV length that dysgu tries to detect is 30bp, so given the high coverage, and the fact you are using nanopore reads, Im wondering if there are lots of reads with gaps (indels) that exceed the minimum size threshold? If these occur relatively uniformly across the genome, it could lead to runtime problems, as every gap is considered a candidate. It might be worth checking if this is the case, just by inspecting the data using IGV or GW.

If there are lots of small gaps in your reads, you can try increasing the --min-size parameter, perhaps to 50 which is common for other tools.

The .bin files just record coverage informations, so are not so interesting.

JMencius commented 4 months ago

Thanks @kcleal I will try add --min-size 50 and --max-cov auto to my command. I will let you know the results ASAP.

kcleal commented 4 months ago

Hi @JMencius, Further to my last message, you should also probably increase the --min-support parameter, on account of your high coverage. Setting to --min-support 10 would be sensible.

JMencius commented 4 months ago

Thanks @kcleal After adding --min-size 50 and --max-cov auto to my command, the DYSGU call run finished in less than 30 minues (12 cores). I will futher add --min-support 10 to my command.

JMencius commented 4 months ago

Hi @kcleal I figure out why the prolonged run time. If I run DYSGU on a unsorted and unindex BAM file, it just stuck forever. After indexing the BAM file, even the default parameter works well. Maybe samtools indexing should be highlighted in the tutorial?

kcleal commented 4 months ago

Ah ok, that makes sense. Glad it works as expected.

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg


From: JMencius @.> Sent: Saturday, March 2, 2024 7:48:56 AM To: kcleal/dysgu @.> Cc: Kez Cleal @.>; Mention @.> Subject: Re: [kcleal/dysgu] Long run time (Issue #84)

External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.

Hi @kclealhttps://github.com/kcleal I figure out why the prolonged run time. If I run DYSGU on a unsorted and unindex BAM file, it just stuck forever. After indexing the BAM file, even the default parameter works well. Maybe samtools indexing should be highlighted in the tutorial?

— Reply to this email directly, view it on GitHubhttps://github.com/kcleal/dysgu/issues/84#issuecomment-1974675033, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKIBQHJD36ARUXL4NUDIIXLYWF76RAVCNFSM6AAAAABD7I2TNCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZUGY3TKMBTGM. You are receiving this because you were mentioned.Message ID: @.***>

JMencius commented 2 months ago

@kcleal Maybe I can contribute to dysgu to fix the problem which raise no error when no bam index is provided. Do you and your team accept pull request?

kcleal commented 2 months ago

Hi @JMencius, Yes, happy to accept a PR, thanks

kcleal commented 2 months ago

Also worth noting that dysgu can work without an index, by streaming reads straight to dysgu, so this is a supported use case. However, I think in some situations, an index is required.

I think an error should always be raised if the bam file is not in position sorted order, so this could be straight forward to add

JMencius commented 2 months ago

I think an error should always be raised if the bam file is not in position sorted order, so this could be straight forward to add

Maybe I can do sth about it