Closed Taghrid-M closed 1 year ago
Hi Taghrid,
I'm sorry to hear you are experiencing this. I just have a few questions:
1) Have you tried first going through DeepVariant Quick Start in order to check that a smaller DeepVariant run completes successfully on your system? 2) How much free memory do you have? 3) How much free disk space do you have? 4) How many CPU cores do you have and how occupied are they? 5) Do you NVIDIA GPUs that are available to you on your system?
I am assuming you are running this on a cluster as DeepVariant can be resource-intensive.
Thank you, Paul
Thanks @pgrosu. Knowing these would be very helpful.
Besides compute, this can also be an issue with the input data.
@Taghrid-M ,
Can you please tell a little more about the data in HG004-hg38.ont.mm2.bam
:
1) What chemistry is this data R9 or R10? 2) What is the basecaller version you used for basecalling this data? 3) What is the average read length of the reads?
Please note, DeepVariant currently supports R10.4 simplex and duplex variant calling for nanopore. If your data is from previous chemistry or basecaller version, please use PEPPER to call variants.
That's a good point @kishwarshafin! I think @Taghrid-M is probably using GIAB data based on the following -- as that's the only Nanopore I see for the HG004 sample -- and then probably using minimap2 to align:
@Taghrid-M is probably using the following documentation (as it seems to match his run):
https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-ont-r104-duplex-case-study.md
In any case, it would still be a huge BAM file requiring significant resources, but I'll let @Taghrid-M fill in the gaps.
Thanks, ~p
Thanks @pgrosu @kishwarshafin, I appreciate your swift reply!
Yes, I am using a cluster, and the data have been obtained from precisionFDA https://data.nist.gov/od/id/mds2-2336
What chemistry is this data? Is it R9 or R10? This data was generated using R9.4 flow cells.
What is the basecaller version you used for basecalling this data? The basecalling process was performed using Guppy Version 3.6.
What is the average read length of the reads? 85X.
Have you tried first going through the DeepVariant Quick Start guide to check if a smaller DeepVariant run completes successfully on your system? Yes, I have successfully run it.
How much free memory do you have? 1.3T
How much free disk space do you have? I have approximately 14T of free disk space.
How many CPU cores do you have, and what is their occupancy level? 16 CPU cores
Do you have any NVIDIA GPUs available on your system? No.
Hi @Taghrid-M,
This is good! One small thing, I think the average read length is 48,060 based on this publication.
The thing is that Guppy 3.6.0 is a bit old, and will have a higher error rate when processing the FAST5 signals from the R9 nanopore through the bidirectional RNN to generate the FASTQ file, as shown in the following post.
So that the proper SNP Pepper model gets selected internally, you can use the --ont_r9_guppy4_hac
argument with run_pepper_margin_deepvariant call_variants
, though not sure version r0.8 has the Guppy 4 model. Otherwise you can use version r0.4 of the Docker container.
Ideally maybe you can get the FAST5 files from the following Amazon S3 page and reprocess them with Guppy 5 $-
$ as that's the latest version that the Pepper model seems to be trained against $-
$ so that you can then utilize the --ont_r9_guppy5_sup
parameter with the r0.8 container, or version r0.5 of the Docker container.
Regarding troubleshooting maybe you can run it with --dry
so you can get the individual commands, so you can run each one individually to determine where the bottleneck is stemming from.
I'll wait for @kishwarshafin to confirm what would be the most effective approach.
Thank you, Paul
Hi @pgrosu thank you for finding the detailed sources! Yes you are exactly right.
@Taghrid-M as @pgrosu said, Guppy 3.6 HAC mode is very old and the only caller supporting that would be PEPPER r0.4. There are several sources of new data for HG002. One of those is the human-pangenome project. For example you can find guppy 6 SUP data from here. Hope this helps.
Hi @kishwarshafin,
Very cool -- absolutely happy to help out and many thanks!
~p
@pgrosu @kishwarshafin
I'm deeply grateful for your thorough explanation and assistance. I'll attempt to utilize HG002 from the human pangenome, following your advice. Your help is greatly appreciated.
Hi
I have attempted to execute a script using DeepVariant, and it has been running for approximately six days now without completion. I've only received intermediate outputs so far, without the expected final results.
Here are the details of the run:
Can you please provide any insight on this issue?
Thank you very much for your time and assistance.