question about deepvariant 1.6

crazysummerW commented 11 months ago

Hello, I tested the T7 model on WGS data using DV1.6, but I keep getting the following error message. I generated the test data using the T7 platform for sequencing. Could you please tell me what went wrong? My cmd:

/opt/deepvariant/bin/run_deepvariant \
  --model_type WGS \
  --ref ${fasta} \
  --reads ${Input.bam} \
  --output_vcf output/output.vcf.gz \
  --output_gvcf output/output.g.vcf.gz \
  --num_shards 32 \
  --intermediate_results_dir output/intermediate_results_dir \
  --regions chr20 \
  --customized_model model/weights-51-0.995354.ckpt

Error message:

***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "output/intermediate_results_dir/call_variants_output.tfrecord.gz" --examples "output/intermediate_results_dir/make_examples.tfreco
rd@42.gz" --checkpoint "model/weights-51-0.995354.ckpt"

/usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).

For more information see: https://github.com/tensorflow/addons/issues/2807

  warnings.warn(
I1102 03:54:58.936793 139651363960640 call_variants.py:471] Total 1 writing processes started.
I1102 03:55:00.378331 139651363960640 dv_utils.py:365] From output/intermediate_results_dir/make_examples.tfrecord-00000-of-00042.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
I1102 03:55:00.378495 139651363960640 call_variants.py:506] Shape of input examples: [100, 221, 7]
I1102 03:55:00.381343 139651363960640 call_variants.py:510] Use saved model: False
/usr/local/lib/python3.8/dist-packages/keras/applications/inception_v3.py:138: UserWarning: This model usually expects 1 or 3 input channels. However, it was passed an input_shape with 7 input channels.
  input_shape = imagenet_utils.obtain_input_shape(
Traceback (most recent call last):
  File "/work/tmp_dir/Bazel.runfiles_exoaulhd/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 633, in <module>
    app.run(main)
  File "/work/tmp_dir/Bazel.runfiles_exoaulhd/runfiles/absl_py/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/work/tmp_dir/Bazel.runfiles_exoaulhd/runfiles/absl_py/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/work/tmp_dir/Bazel.runfiles_exoaulhd/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 618, in main
    call_variants(
  File "/work/tmp_dir/Bazel.runfiles_exoaulhd/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 555, in call_variants
    model = modeling.inceptionv3(
  File "/work/tmp_dir/Bazel.runfiles_exoaulhd/runfiles/com_google_deepvariant/deepvariant/keras_modeling.py", line 312, in inceptionv3
    backbone = add_l2_regularizers(
  File "/work/tmp_dir/Bazel.runfiles_exoaulhd/runfiles/com_google_deepvariant/deepvariant/keras_modeling.py", line 99, in add_l2_regularizers
    model.save_weights(tmp_weights_path)
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 562, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 241, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 122, in h5py.h5f.create
OSError: [Errno 5] Unable to synchronously create file (unable to lock file, errno = 5, error message = 'Input/output error')
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 722, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/opt/deepvariant/bin/run_deepvariant.py", line 711, in main
    for line in proc.stdout:
KeyboardInterrupt

Looking forward to your reply. Thanks.

akolesnikov commented 11 months ago

This looks like an OS error. Also, it looks like the error is raised while building the model graph backbone = add_l2_regularizers. Could it be that you ran out of RAM?

crazysummerW commented 11 months ago

Hi, @akolesnikov I have about 90GB of memory and I only analyzed chromosome 20. I had no problem running DV1.6 on the same server. However, when I specified to use T7 model parameters(--customized_model model/weights-51-0.995354.ckpt), an error occurred. What should I do about this?

crazysummerW commented 11 months ago

Hi，@pichuan Would you mind giving me some advice on the issue I encountered? Thank you very much.

akolesnikov commented 11 months ago

@crazysummerW how was this model created? Did you follow the model training case study? Could you include the command line you used to build the model?

crazysummerW commented 11 months ago

hi, @akolesnikov In the detailed information of releases v1.6, I noticed that DV1.6 has added new models trained with Complete Genomics data, and added case studies. I followed your doc:https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-complete-t7-case-study.md The model file was downloaded from here:

pichuan commented 11 months ago

Hi @crazysummerW ,

Looking at your error, it seems like this might be relevant:

  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 241, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 122, in h5py.h5f.create
OSError: [Errno 5] Unable to synchronously create file (unable to lock file, errno = 5, error message = 'Input/output error')

This is because this logic in our code writes a temp file: https://github.com/google/deepvariant/blob/r1.6/deepvariant/keras_modeling.py#L97-L99

  tmp_weights_dir = tempfile.gettempdir()
  tmp_weights_path = os.path.join(tmp_weights_dir, 'tmp_weights.h5')
  model.save_weights(tmp_weights_path)

Can you check your setting, and see if somehow your run wasn't able to create a temp file?

I reran our set up in https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-complete-t7-case-study.md (using a GCP machine as an example) and wasn't able to reproduce that error. So, it'll be very helpful for me to understand your machine setup, and try to make our code more robust in the future.

Thank you!

crazysummerW commented 11 months ago

@pichuan I tested the docker deepvariant:1.6 on a CPU-only machine. And I changed tmp dir:

mkdir -p output/intermediate_results_dir
mkdir -p output/tmp_dir
export TMPDIR="$PWD/output/tmp_dir"

Does this have any impact?

kishwarshafin commented 11 months ago

@crazysummerW,

Sorry for the late reply but no, changing the temp directory will not have any affect.

ZuyaoLiu commented 11 months ago

@crazysummerW I had the same issue here.

It turned out to be the problem of the h5 file in the tmp dir. If multiple programs open the h5 simultaneously, the error would occur. So I avoided this by creating a unique tmp dir for each sample, which used a lot of file handles.

@pichuan @kishwarshafin Could you please take a look at this? Probably renaming the h5 file to keep it unique would be a simple and easy solution.

pichuan commented 11 months ago

Hi @ZuyaoLiu ,

Just to clarify, you meant that you're running multiple call_variants on the same machine at the same time, so they're all opening the same tmp file?

If that's the case, then I can see that being an issue. I'll file an internal issue to track, and will name the h5 separately. Currently our code is: https://github.com/google/deepvariant/blob/r1.6/deepvariant/keras_modeling.py#L98 and I can see this being an issue when multiple call_variants are run. We'll make sure to create a more unique filepath in the future to avoid issue!

On the other hand, historically we don't recommend running multiple call_variants runs on the same machine. Because TensorFlow will parallelize and use multiple CPUs already.

@ZuyaoLiu @crazysummerW Just for my sanity check, can you confirm that if you run just one call_variants on the machine, then it worked? (I want to make sure there are no other issues)

ZuyaoLiu commented 11 months ago

Hi @ZuyaoLiu ,

Just to clarify, you meant that you're running multiple call_variants on the same machine at the same time, so they're all opening the same tmp file?

If that's the case, then I can see that being an issue. I'll file an internal issue to track, and will name the h5 separately. Currently our code is: https://github.com/google/deepvariant/blob/r1.6/deepvariant/keras_modeling.py#L98 and I can see this being an issue when multiple call_variants are run. We'll make sure to create a more unique filepath in the future to avoid issue!

On the other hand, historically we don't recommend running multiple call_variants runs on the same machine. Because TensorFlow will parallelize and use multiple CPUs already.

@ZuyaoLiu @crazysummerW Just for my sanity check, can you confirm that if you run just one call_variants on the machine, then it worked? (I want to make sure there are no other issues)

@pichuan ,

Yes, you get me correctly. So I run call_variants on a cluster where each node is in charge of a single job. As I set my private tmp dir, these jobs will target the same h5 file, and it will cause the issue.

Currently, I am running the program by setting a unique tmp path to different jobs so that they won't use the same h5 file simultaneously. And they all worked well and finished with no errors.

crazysummerW commented 11 months ago

hi， @pichuan I ran just one call_variants on the machine, but it did not work.

pichuan commented 10 months ago

Hi @crazysummerW , so it seems like you might have a different issue. Is it possible that in your setting, you don't have write access to the directory that tempfile.gettempdir() gave you? I'll need more information from you to pinpoint the issue (because I can't reproduce it on my side yet)

For example, on my machine:

$ python -c 'import tempfile; foo=tempfile.gettempdir(); print(foo)'
/tmp
$ export TMPDIR=${HOME}; python -c 'import tempfile; foo=tempfile.gettempdir(); print(foo)'
/home/pichuan

@crazysummerW , I wonder if it's possible that you don't have write access to your /tmp? If so, can you try setting TMPDIR to a directory that you have write access to?

pichuan commented 9 months ago

Hi @crazysummerW , I'm curious whether you're able to resolve this.

Given that there hasn't been updates for 2 months now, I'll close this for now. But please feel free to reopen if you still have issues, or feel free to post updates if you have new findings. Thank you!

google / deepvariant

question about deepvariant 1.6 #725