google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.25k stars 728 forks source link

No output vcf obtained from custom model #869

Closed bioinformaticaomicalabs closed 2 months ago

bioinformaticaomicalabs commented 3 months ago

ISSUE First of all, I found DeepVariant to be a very good and innovative tool. I'm considering including it in my exome analysis pipeline. I followed the tutorial (DeepVariant worked correctly with the Complete Genomics model), and I created my own model using Genome in a Bottle samples. To do this, I sequenced the same reference sample three times to use each BAM file for training, validation, and testing. I didn't encounter any errors during the model creation process, but when I tried to test it, the process got stuck at the call_variants step.

Setup

Steps to reproduce:

real 34m15.728s user 624m43.553s sys 2m24.932s

Running the command: time /opt/deepvariant/bin/call_variants --outfile "/output/intermediate_results_dir/call_variants_output.tfrecord.gz" --examples "/output/intermediate_results_dir/make_examples.tfrecord@20.gz" --checkpoint "/output/checkpoints/ckpt-679"

/usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:

TensorFlow Addons (TFA) has ended development and introduction of new features. TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024. Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).

For more information see: https://github.com/tensorflow/addons/issues/2807

warnings.warn( I0822 07:52:10.812179 127086447671104 call_variants.py:563] Total 1 writing processes started. I0822 07:52:10.813103 127086447671104 dv_utils.py:370] From /output/intermediate_results_dir/make_examples.tfrecord-00000-of-00020.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19]. I0822 07:52:10.813141 127086447671104 call_variants.py:588] Shape of input examples: [100, 221, 7] I0822 07:52:10.813338 127086447671104 call_variants.py:592] Use saved model: True Traceback (most recent call last): File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 789, in app.run(main) File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/absl_py/absl/app.py", line 312, in run _run_main(main, args) File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/absl_py/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 768, in main call_variants( File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 598, in call_variants model_example_shape = dv_utils.get_shape_and_channels_from_json( File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/dv_utils.py", line 367, in get_shape_and_channels_from_json example_info = json.load(f) File "/usr/lib/python3.8/json/init.py", line 293, in load return loads(fp.read(), File "/usr/lib/python3.8/json/init.py", line 357, in loads return _default_decoder.decode(s) File "/usr/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Process ForkProcess-1: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/tmp/Bazel.runfiles_5v5s5_vp/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 454, in post_processing item = output_queue.get(timeout=180) File "/usr/lib/python3.8/multiprocessing/queues.py", line 108, in get raise Empty _queue.Empty

real 3m2.335s user 0m7.450s sys 0m4.274s`

Does the quick start test work on your system? Yes

lucasbrambrink commented 3 months ago

Hi!

Thanks for the kind words. The issue seems to be that call_variants expects a JSON file with the checkpoint.

model_example_info_json = f'{checkpoint_path}/example_info.json'
model_example_shape = dv_utils.get_shape_and_channels_from_json(
    model_example_info_json
)

Can you confirm that this file exists under "/output/checkpoints/ckpt-679"? If it exists, could you paste the content of it here?

bioinformaticaomicalabs commented 3 months ago

Hi!

Thanks for the kind words. The issue seems to be that call_variants expects a JSON file with the checkpoint.

model_example_info_json = f'{checkpoint_path}/example_info.json'
model_example_shape = dv_utils.get_shape_and_channels_from_json(
    model_example_info_json
)

Can you confirm that this file exists under "/output/checkpoints/ckpt-679"? If it exists, could you paste the content of it here?

Hello, first, thank you very much for your help. Yes, the example_info.json exist, but do not have any content. Should it contain some information? Captura desde 2024-08-27 08-57-05

kishwarshafin commented 3 months ago

@bioinformaticaomicalabs , yes it should contain some information like this:

gsutil cat gs://deepvariant/models/DeepVariant/1.6.1/checkpoints/wgs/example_info.json
{"version": "1.6.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}⏎
kishwarshafin commented 3 months ago

Can you check if you have example_info.json output in your training data and validation data generation folders and if they are the same? If same then you can copy it to the directory and use it. The training loop is supposed to copy the example_info.json from training folder to the checkpoint output directory. Not sure if it was missing in your setup.

bioinformaticaomicalabs commented 2 months ago

Thank you very much, that finally resolved the issue. What happens now is that after performing the evaluation (the training, test, and validation sets are the sequencing of the NA12878 sample from the Coriell Institute sequenced three times), I’m getting low recall and precision values. For indels, recall is 0.41 and precision is 0.24, and for SNPs, recall is 0.57 and precision is 0.72. I tried the model you provided for exome sequencing, but I didn’t achieve better results (which is why I decided to create my own model). However, with typical tools (like GATK HaplotypeCaller), I get much better results (indels with 0.8 recall and 0.62 precision, and SNPs with 0.89 recall and 0.97 precision). Do you have any idea why this might be happening and any advice on how to solve it? I really believe that using a variant caller trained with my data should yield better results than GATK HaplotypeCaller

kishwarshafin commented 2 months ago

@bioinformaticaomicalabs , if you have a benchmarking bam file that you can share then I can try to run it on our default model and see if it matches your expectations. However, why training didn't work would require some more investigation on how exactly you are training and preparing your data.

bioinformaticaomicalabs commented 2 months ago

Hello, the bam file es 13 GB, do you know how can i send it to you?

kishwarshafin commented 2 months ago

@bioinformaticaomicalabs if you have GCP then that would be the best way for us. Any public bucket that you currently use to share data works for us. If the data is uploaded to SRA then you can link it to me but it might take a bit longer in that case. However you prefer, please send it to shafin@google.com

kishwarshafin commented 2 months ago

@bioinformaticaomicalabs , closing this due to no activity. If you think you need further help, please reopen or send an email.