google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.15k stars 709 forks source link

Postprocess_variants.py ValueError: ptrue must be between zero and one: nan #849

Closed karoliinas closed 1 month ago

karoliinas commented 1 month ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.6.1/docs/FAQ.md: Yes

Describe the issue: I have processed around 30 samples albeit having some issues with GPU, possibly due to nvidia driver / cuda version. However, recently postprocess has started stalling with the same error. Any help troubleshooting this would be greatly appreciated!

Setup

Steps to reproduce:

CUDA Version 11.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2024-07-10 12:07:21.275077: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. I0710 12:07:24.889796 139944337696576 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: sample1 I0710 12:09:25.874185 139944337696576 postprocess_variants.py:1313] CVO sorting took 2.0161957065264384 minutes I0710 12:09:25.874843 139944337696576 postprocess_variants.py:1316] Transforming call_variants_output to variants. I0710 12:09:25.874915 139944337696576 postprocess_variants.py:1318] Using 19 CPUs for parallelization of variant transformation. I0710 12:09:45.096508 139944337696576 postprocess_variants.py:1211] Using sample name from call_variants output. Sample name: sample1 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1125, in _mappable_transform_call_variant_group_to_output_variant return _transform_call_variant_group_to_output_variant(**kwargs) File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1036, in _transform_call_variant_group_to_output_variant return add_call_to_variant( File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 434, in add_call_to_variant gq, variant.quality = compute_quals(predictions, index) File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 469, in compute_quals genomics_math.ptrue_to_bounded_phred(predictions[prediction_index]) File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/third_party/nucleus/util/genomics_math.py", line 143, in ptrue_to_bounded_phred raise ValueError('ptrue must be between zero and one: {}'.format(ptrue)) ValueError: ptrue must be between zero and one: nan """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1419, in app.run(main) File "/tmp/Bazel.runfiles_i47tupw0/runfiles/absl_py/absl/app.py", line 312, in run _run_main(main, args) File "/tmp/Bazel.runfiles_i47tupw0/runfiles/absl_py/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1385, in main tmp_variant_file = dump_variants_to_temp_file(variant_generator) File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1067, in dump_variants_to_temp_file tfrecord.write_tfrecords(variant_protos, temp.name) File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/third_party/nucleus/io/tfrecord.py", line 190, in write_tfrecords for proto in protos: File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/deepvariant/haplotypes.py", line 91, in maybe_resolve_conflicting_variants for overlapping_candidates in _group_overlapping_variants(sorted_variants): File "/tmp/Bazel.runfiles_i47tupw0/runfiles/com_google_deepvariant/deepvariant/haplotypes.py", line 111, in _group_overlapping_variants for variant in sorted_variants: File "/usr/lib/python3.8/multiprocessing/pool.py", line 420, in return (item for chunk in result for item in chunk) File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next raise value ValueError: ptrue must be between zero and one: nan



**Does the quick start test work on your system?** Yes
Please test with https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-quick-start.md.
Is there any way to reproduce the issue by using the quick start?

**Any additional context:**
pichuan commented 1 month ago

Hi @karoliinas , From your log, it seems like the DeepVariant model has made a prediction with unexpected numerical value.

From your log, I'm unable to tell why this has occurred.

In this command:

podman run -it --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --gpus 1 -v /data:/data --device nvidia.com/gpu=all google/deepvariant:1.6.1-gpu /opt/deepvariant/bin/postprocess_variants --ref "/data/references/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz" --infile "/data/variants/sample1.intermediate/call_variants_output.tfrecord.gz" --outfile "/data/variants/sample1.vcf.gz" --cpus "19" --gvcf_outfile "/data/variants/sample1.g.vcf.gz" --nonvariant_site_tfrecord_path "/data/variants/sample1.intermediate/gvcf.tfrecord@19.gz"

If you can share the sample1.intermediate/call_variants_output.tfrecord.gz (and optionallysample1.intermediate/gvcf.tfrecord@19.gz files) with me, I can able to look into the records and see which example has this issue. (Or, if you can narrow this down to a small BAM file, and if you can share that BAM file, that works too). Please email to pichuan@google.com if you can share.

If you can't share the files, we can think about what we can do here to help identify which example caused the issue.

karoliinas commented 1 month ago

Hi @pichuan, thank you for getting back so quickly! I'm working on patient data, so unfortunately it's not something I can share.

About the files you requested, I now see that there is no file called: sample1.intermediate/call_variants_output.tfrecord.gz, instead there's a number of files: call_variants_output-[00000-00015]-of-00016.tfrecord.gz also instead of gvcf.tfrecord@19.gz there are: gvcf.tfrecord-[00000-00018]-of-00019.gz

That's probably why the above command for postprocess_variants doesn't work, right? I am using 19 threads. I copied the command from --dry_run=true. So my question is, how to pass multiple arguments to--infile and --nonvariant_site_tfrecord_path?

Thank you so much, I'm very happy if it turns I merely had a faulty command!

karoliinas commented 1 month ago

Oh, and since we're here, I mentioned problems with the GPU. It seems that DeepVariant is using cuda version 11.3.1, but the nvidia driver (version 555.42.02) on the server is using cuda version 12.5

nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           Off |   00000000:00:07.0 Off |                    0 |
| N/A   28C    P0             35W /  250W |       1MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Should I update the driver, cuda or both? Also, installed on the server is cuda version 12.3, so I'm feeling a bit confused as to where the 12.5 comes from.

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

It's a right mess! Sometimes call_variants uses gpu and at other times it stalls. That's why I'm now running the commands separately, I have a lot of outputs from make_examples. Our IT support say they're happy the gpu works part of the time :) It's adding a lot of extra work, and I'm trying to come up with a solution.

Since it's completely unrelated to the topic, perhaps I should create a new issue instead?

pichuan commented 1 month ago

Hi @karoliinas ,

The format gvcf.tfrecord@19.gz is referring to files gvcf.tfrecord-[00000-00018]-of-00019.gz. So I don't think your commands are wrong. I think your call_variants_output-[00000-00015]-of-00016.tfrecord.gz files likely contain prediction values that are unexpected.

I'm out of office now. I'll give you some examples to debug the call_variants_output next week!

karoliinas commented 1 month ago

Hi @pichuan, thanks for clearing this up! When you get the chance, please let me know what to look for in the call_variants -output. Also, I'm not sure I understand the format, using zcat I get many very short lines. I see AD, DP and VAF but not sure how to read variant positions / probabilities.

Many thanks! -Karoliina

akolesnikov commented 1 month ago

One thing I noticed is that you have multiple outputs: one generated with --cpus=16 and one with --cpus=19. If you set cpus to 19 then the expected input to postprocesss should be call_variants_output-[00000-00018]-of-00019.tfrecord.gz. But in your comment you have call_variants_output-[00000-00015]-of-00016.tfrecord.gz. So, it looks like you had multiple runs with different cpus settings. This could be the reason postprocess fails. I suggest to clean the call_variants output or rerun it pointing to the different directory and then run postprocess with the same number of CPUs.

karoliinas commented 1 month ago

Nice catch! Thank you! It does seem that there are different number of outputs, though I've (to my knowledge) only used the commands from the full run with 19 cpus and --dry_run=true. I've cleared the output directories and am running the full command from the beginning.

Which nvidia-driver - cuda combination do you run deepvariant with? I'm looking into the gpu -problem, I'm thinking I need to install an older nvidia-driver and cuda. Currently the driver is 555.42.02, but looking at this https://docs.nvidia.com/deploy/cuda-compatibility/#id1 , it's not compatible with cuda 11.3.1 that the deepvariant:1.6.1-gpu is using.

karoliinas commented 1 month ago

Hello @pichuan, I ran the full deepvariant pipeline after deleting all output directories from the previous run. It seems call_variants outputs only 16 files to the intermediate dir, whereas make_examples outputs 19 (with --num_shards 19). Here's the full command:

podman run -it --rm -e LD_LIBRARY_PATH=/usr/bin:/usr/lib/nvidia:/usr/local/nvidia/:/usr/local/cuda-12.3/lib64:/usr/local/cuda-12.3/bin:/usr/local/lib/python3.8/dist-packages/tensorrt_libs/ --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --gpus 1 -v /data:/data --device nvidia.com/gpu=all google/deepvariant:1.6.1-gpu /opt/deepvariant/bin/run_deepvariant --model_type=WGS --regions 'chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM' --num_shards 19 --ref=/data/references/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz --reads=/data/bamfiles/sample1.E250013.L1.hg38.rg.bam --output_vcf=/data/variants/sample1.vcf.gz --output_gvcf=/data/variants/sample1.g.vcf.gz --intermediate_results_dir=/data/variants/sample1.intermediate --logging_dir=/data/variants/sample1.logs

Adding the ld_library_path -argument gets rid of the error messages about libvinfer, however I still get the cuda error:

2024-07-16 14:14:08.323907: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error

Although call_variants did use gpu and ran in about half an hour. Then postprocess_variants halts with: ValueError: ptrue must be between zero and one: nan

(Full error log in the first message) I'll try to play around with --num_shards next.

pichuan commented 1 month ago

Hi @karoliinas,

Given that you're having weird numerical prediction values from call_variants output, and that you mentioned your GPU version is newer that what we used in DeepVariant 1.6, I strongly suspect your GPU+DeepVariant setting is producing unexpected output.

Would it be possible for you to:

  1. Use the compatible GPU driver version? (I understand this is annoying. We've made the CUDA update internally already, and it'll be out in the next version. But if it's possible to test with a compatible one, that might be easier for you)
  2. Just to confirm whether it's the hardware issue: Can you run with CPU and see if it still crashes with the same error? That will help us identify whether it's the hardware, or actually something unexpected with your input file.
karoliinas commented 1 month ago

Hi @pichuan, thanks for looking into this! And you're right, though changing the --num_shards to 16 did result in the same number of files from make_examples call_variants, the error remains.

Which driver / cuda -versions are supported? The server I have is RHEL9, and the oldest available driver begins with 515 and cuda with 11.7. Would they do? I checked the available drivers from nvidia repo: https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/

Do you have any estimate as to when the new deepvariant -version will be out, since it might be easier to wait than set up a new server (I'm not sure we have RHEL8 available).

I'll try to run it with cpu for now, and will let you know how it goes. The funny thing is, I already processed 19 samples with this set up.

karoliinas commented 1 month ago

Hi again, indeed the cpu -version works. Boy am I glad there's no problem with the data! We'll have to wait for the deepvariant update to get the gpu going. Will it be using the same model, so that the samples processed the new version will be compatible with the samples processed with the current one?

I'm getting new samples in every week, and we will have ~150 in a couple of months, so I'm interested as to not end up running them twice. The other option would be to upgrade the vm to one without gpu and more cpu:s to continue with the current cpu -version.

pichuan commented 1 month ago

Hi @karoliinas , for stability and reproducibility, using CPU version is likely the better way to go.

In terms of GPU updates, in our developmental branch (https://github.com/google/deepvariant/tree/dev) we actually update the GPU version (and Ubuntu version). For example you can see: https://github.com/google/deepvariant/blob/dev/Dockerfile But, we won't be building a new Docker version until next release.

I'll close this now. Please let us know if you have more questions.