google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.19k stars 721 forks source link

DeepVariant has been running for 17 days #578

Closed olechnwin closed 1 year ago

olechnwin commented 1 year ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.4/docs/FAQ.md: yes

Describe the issue: I am running DeepVariant on a custom genome assembly using a hybrid of pacbio hifi and illumina short reads and it's been running for 17days. I wonder if something is wrong and is there a way to speed thing up? I am already using 30 shards.

Setup

Steps to reproduce:

Does the quick start test work on your system? Please test with https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-quick-start.md. Is there any way to reproduce the issue by using the quick start? I was able to run quick start. I was also able to run DeepVariant on the same singularity system with pacbio HiFi reads only, using human reference genome hg19.

Any additional context: r_deepvariant_hybrid_2_662510.txt

kishwarshafin commented 1 year ago

@olechnwin ,

Looking at the log, it is generating a lot of examples. Is the error-rate of the scaffolds_FINAL.fasta too high? Can you give a little more context on what type of genome you are running this on?

olechnwin commented 1 year ago

@kishwarshafin,

Thank you so much for your reply. How do I get the error rate of the genome assembly? This is a human cancer cell line. I ran quast on the assembly comparing it with hg19 and the number of misassemblies ~2000 and the number of mismatches per 100kb is 124.3. Why does it generate a lot of examples?

kishwarshafin commented 1 year ago

@olechnwin ,

So the reference you are using "scaffolds_FINAL.fasta" can be a low quality reference/assembly in which case when you align the reads to that assembly there will be a lot of mismatches observed in the read compared to the assembly. One thing you can do is to inspect your hybrid_hifi_Kapa_combined.bam a bit manually on IGV to see if your alignments look generally decent. If you have too many mismatches observed in the reads, it will generate a lot of examples.

DeepVariant runtime is measured against higher quality human genomes (i.e. GRCh38/T2T-CHM13). If you are using a low quality assembly, then your runtime would increase.

olechnwin commented 1 year ago

@kishwarshafin ,

I am actually trying to polish the assembly using DeepVariant and increase the quality of the assembly. Do you have any suggestions on how to speed it up? Is there additional step I should do to increase the quality before running DeepVariant? Does increasing number of shards help?

Thanks!

kishwarshafin commented 1 year ago

@olechnwin ,

In that case you need to increase the SNP and indel finding thresholds using --make_examples_extra_args="vsc_min_fraction_snps=0.2,vsc_min_fraction_indels=0.2" parameter. Here 0.2 means the allele frequency should be 20% to be a candidate. You can set this to a threshold of your choice.

olechnwin commented 1 year ago

@kishwarshafin,

Thank you! I'll try to change the threshold. Can you please elaborate more on how to choose this threshold? Is there a more detail documentation on what to increase the SNP threshold for polishing the genome? Thanks again!

pichuan commented 1 year ago

Hi @olechnwin ,

The thresholds are usually chosen empirically. Based on what tasks we're trying to achieve, we choose it to find the best tradeoff between sensitivity and the amount of noises we bring in. This is more a research problem, especially that you're trying to adapt DeepVariant code to a different problem. So we won't be able to easily share a recipe here for you.