google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.16k stars 713 forks source link

Prolonged DeepVariant Script Execution Time #681

Closed Taghrid-M closed 1 year ago

Taghrid-M commented 1 year ago

Hi

I have attempted to execute a script using DeepVariant, and it has been running for approximately six days now without completion. I've only received intermediate outputs so far, without the expected final results.

Here are the details of the run:

singularity run -B ${SOFTWARE_DIR}:/software --bind ${INPUT_DIR}:/input --bind ${OUTPUT_DIR}:/output \
docker://google/deepvariant:"${BIN_VERSION}" \
 /opt/deepvariant/bin/run_deepvariant \
 --model_type=ONT_R104 \
 --ref=/input/references/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
 --reads=/input/HG004-hg38.ont.mm2.bam \
 --output_vcf=/output/HG004_hg38.vcf.gz \
 --output_gvcf=/output/HG004_hg38.g.vcf.gz \
 --num_shards=12 \
 --intermediate_results_dir=/output/ \
 --dry_run=false

Can you please provide any insight on this issue?

Thank you very much for your time and assistance.

pgrosu commented 1 year ago

Hi Taghrid,

I'm sorry to hear you are experiencing this. I just have a few questions:

1) Have you tried first going through DeepVariant Quick Start in order to check that a smaller DeepVariant run completes successfully on your system? 2) How much free memory do you have? 3) How much free disk space do you have? 4) How many CPU cores do you have and how occupied are they? 5) Do you NVIDIA GPUs that are available to you on your system?

I am assuming you are running this on a cluster as DeepVariant can be resource-intensive.

Thank you, Paul

kishwarshafin commented 1 year ago

Thanks @pgrosu. Knowing these would be very helpful.

Besides compute, this can also be an issue with the input data.

@Taghrid-M ,

Can you please tell a little more about the data in HG004-hg38.ont.mm2.bam:

1) What chemistry is this data R9 or R10? 2) What is the basecaller version you used for basecalling this data? 3) What is the average read length of the reads?

Please note, DeepVariant currently supports R10.4 simplex and duplex variant calling for nanopore. If your data is from previous chemistry or basecaller version, please use PEPPER to call variants.

pgrosu commented 1 year ago

That's a good point @kishwarshafin! I think @Taghrid-M is probably using GIAB data based on the following -- as that's the only Nanopore I see for the HG004 sample -- and then probably using minimap2 to align:

https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG004_NA24143_mother/UCSC_Ultralong_OxfordNanopore_Promethion/

@Taghrid-M is probably using the following documentation (as it seems to match his run):

https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-ont-r104-duplex-case-study.md

In any case, it would still be a huge BAM file requiring significant resources, but I'll let @Taghrid-M fill in the gaps.

Thanks, ~p

Taghrid-M commented 1 year ago

Thanks @pgrosu @kishwarshafin, I appreciate your swift reply!

Yes, I am using a cluster, and the data have been obtained from precisionFDA https://data.nist.gov/od/id/mds2-2336

What chemistry is this data? Is it R9 or R10? This data was generated using R9.4 flow cells.

What is the basecaller version you used for basecalling this data? The basecalling process was performed using Guppy Version 3.6.

What is the average read length of the reads? 85X.

Have you tried first going through the DeepVariant Quick Start guide to check if a smaller DeepVariant run completes successfully on your system? Yes, I have successfully run it.

How much free memory do you have? 1.3T

How much free disk space do you have? I have approximately 14T of free disk space.

How many CPU cores do you have, and what is their occupancy level? 16 CPU cores

Do you have any NVIDIA GPUs available on your system? No.

pgrosu commented 1 year ago

Hi @Taghrid-M,

This is good! One small thing, I think the average read length is 48,060 based on this publication.

The thing is that Guppy 3.6.0 is a bit old, and will have a higher error rate when processing the FAST5 signals from the R9 nanopore through the bidirectional RNN to generate the FASTQ file, as shown in the following post.

So that the proper SNP Pepper model gets selected internally, you can use the --ont_r9_guppy4_hac argument with run_pepper_margin_deepvariant call_variants, though not sure version r0.8 has the Guppy 4 model. Otherwise you can use version r0.4 of the Docker container.

Ideally maybe you can get the FAST5 files from the following Amazon S3 page and reprocess them with Guppy 5 $-$ as that's the latest version that the Pepper model seems to be trained against $-$ so that you can then utilize the --ont_r9_guppy5_sup parameter with the r0.8 container, or version r0.5 of the Docker container.

Regarding troubleshooting maybe you can run it with --dry so you can get the individual commands, so you can run each one individually to determine where the bottleneck is stemming from.

I'll wait for @kishwarshafin to confirm what would be the most effective approach.

Thank you, Paul

kishwarshafin commented 1 year ago

Hi @pgrosu thank you for finding the detailed sources! Yes you are exactly right.

@Taghrid-M as @pgrosu said, Guppy 3.6 HAC mode is very old and the only caller supporting that would be PEPPER r0.4. There are several sources of new data for HG002. One of those is the human-pangenome project. For example you can find guppy 6 SUP data from here. Hope this helps.

pgrosu commented 1 year ago

Hi @kishwarshafin,

Very cool -- absolutely happy to help out and many thanks!

~p

Taghrid-M commented 1 year ago

@pgrosu @kishwarshafin

I'm deeply grateful for your thorough explanation and assistance. I'll attempt to utilize HG002 from the human pangenome, following your advice. Your help is greatly appreciated.