No output files (VCFs) for all genome

vera-gomes commented 3 months ago

Describe the issue: I am not obtaining any output files even though there are no major issues in the log file, (see attached) I ran it with the same data first, only for the chr20, and everything went fine. For all the genome now, I don't have the vcfs. deepvariant_run.log

Setup

Operating system: Windows, WSL2 (5.15.146.1-microsoft-standard-WSL2)
DeepVariant version: 1.4.0
Installation method: Docker
Type of data: NA12878, bam file

Steps to reproduce: sudo docker run \ -v "/mnt/c/Users/pinto/OneDrive - Universidade de Lisboa/Revisao bibliografica/Scoping Review/alg_testing:/input" \ -v "/mnt/c/Users/pinto/OneDrive - Universidade de Lisboa/Revisao bibliografica/Scoping Review/alg_testing/output:/output" \ google/deepvariant:1.4.0 \ /opt/deepvariant/bin/run_deepvariant \ --model_type=WGS \ --ref=/input/genome.fa \ --reads=/input/sorted.bam \ --output_vcf=/output/outputdeepvar.vcf \ --output_gvcf=/output/outputdeepvar.g.vcf \ --num_shards=4 \

"/mnt/c/Users/pinto/OneDrive - Universidade de Lisboa/Revisao bibliografica/Scoping Review/alg_testing/output/deepvariant_run.log" 2>&1

pichuan commented 3 months ago

From your log, I suspect that the postprocess_variants step failed. Your log only shows this:

***** Running the command:*****
time /opt/deepvariant/bin/postprocess_variants --ref "/input/genome.fa" --infile "/tmp/tmpih5xsned/call_variants_output.tfrecord.gz" --outfile "/output/outputdeepvar.vcf" --nonvariant_site_tfrecord_path "/tmp/tmpih5xsned/gvcf.tfrecord@10.gz" --gvcf_outfile "/output/outputdeepvar.g.vcf"

I0813 04:00:29.233929 140206500009792 postprocess_variants.py:971] Using sample name from call_variants output. Sample name: NA12878
2024-08-13 04:00:29.242227: I deepvariant/postprocess_variants.cc:88] Read from: /tmp/tmpih5xsned/call_variants_output.tfrecord.gz

Which is surprising, because I'd expect more errors if anything is wrong.

Did you observe any issues with RAM or disk space running out in the last step?

By the way, next time you run it, you can set this flag --intermediate_results_dir:

like in: https://github.com/google/deepvariant/blob/r1.6.1/docs/deepvariant-quick-start.md#run-deepvariant-with-one-command

That way, the output from the make_examples and call_variants steps will be saved, and you can just rerun postprocess_variants step if needed.

melop commented 3 months ago

I can confirm this problem on some of my samples. The program never produces the final outputs due to the post processing step ending prematurely (but no error message reported).

Normal log: ` I0814 02:06:33.719291 140247404730176 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: default 2024-08-14 02:06:33.734191: I deepvariant/postprocess_variants.cc:94] Read from: /public4/courses/ec3121/shareddata/Camellia_Sect_Chrysantha/bwa_hapbetter/wgs/deepvariant/tmp/tmp1gvo5vri/call_variants_output-00000-of-00001.tfrecord.gz 2024-08-14 02:13:18.938389: I deepvariant/postprocess_variants.cc:109] Total #entries in single_site_calls = 71894602 I0814 02:35:15.649105 140247404730176 postprocess_variants.py:1313] CVO sorting took 28.698674070835114 minutes I0814 02:35:15.649988 140247404730176 postprocess_variants.py:1316] Transforming call_variants_output to variants. I0814 02:40:15.761767 140247404730176 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: default I0814 05:02:43.606994 140247404730176 postprocess_variants.py:1386] Processing variants (and writing to temporary file) took 142.46444999376934 minutes I0814 06:01:36.673851 140247404730176 postprocess_variants.py:1407] Finished writing VCF and gVCF in 58.884093316396076 minutes.

real 235m30.029s user 220m0.378s sys 13m54.784s

`

Samples with problems: ` I0814 16:35:25.856544 140492383778624 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: default 2024-08-14 16:35:25.879399: I deepvariant/postprocess_variants.cc:94] Read from: /public4/courses/ec3121/shareddata/Camellia_Sect_Chrysantha/bwa_hapbetter/wgs/deepvariant/tmp/tmpfrusl15j/call_variants_output-00000-of-00001.tfrecord.gz 2024-08-14 16:44:06.712469: I deepvariant/postprocess_variants.cc:109] Total #entries in single_site_calls = 92795573 I0814 17:09:30.584156 140492383778624 postprocess_variants.py:1313] CVO sorting took 34.07839868863424 minutes I0814 17:09:30.621869 140492383778624 postprocess_variants.py:1316] Transforming call_variants_output to variants. I0814 17:15:23.469285 140492383778624 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: default

real 744m1.767s user 58m24.804s sys 64m10.806s

`

I set --postprocess_variants_extra_args="cpus=0" following previous suggestions. This allowed more samples to finish but still others did not.

melop commented 3 months ago

This was probably due to running out of RAM, since running on a node with 2TB of RAM produced an output. Perhaps there needs to be a way to limit RAM usage in subsequent versions.

vera-gomes commented 3 months ago

Hi @pichuan, I re-ran it with more disk space and had the same issue. Hi @melop, thank you for your input and for exemplifying what should be a standard output; per your recommendation, it should be, as @pichuan had already guessed, a RAM issue.

Thank you both. I'll see what I can do on my end to solve the RAM issue, and I'll follow up.

pichuan commented 3 months ago

Hi @vera-gomes , One possible way to get around the RAM issue is to split the run into two or more, using the --regions flag (For example, one run can run the first 8 chromosomes, and the second run can run the rest, or something like that). And then at the end you can combine the VCFs.

melop commented 3 months ago

Hi @vera-gomes , One possible way to get around the RAM issue is to split the run into two or more, using the --regions flag (For example, one run can run the first 8 chromosomes, and the second run can run the rest, or something like that). And then at the end you can combine the VCFs.

Perhaps a user-friendly way to implement this is to let deepvariant split the job by chromosomes/regions automatically? If this mainly affects the post processing step, perhaps just make this step automatically process the output by chromosome/regions of a fixed window size?

lucasbrambrink commented 3 months ago

Hi @melop,

Great suggestion! As luck would have it, this will be a feature in our next release (1.7.0). We have parallelized/sharded postprocess_variants across multiple CPUs, which helps to reduce its maximum RAM footprint. It also takes in a --regions flag directly so you can easily split up the process further if that's necessary (although it shouldn't be).

Until 1.7.0 is released, your only option is to follow @pichuan's suggestion. You can use bcftools concat to join the region-specific VCFs back together. I hope that helps!

google / deepvariant

No output files (VCFs) for all genome #868