Closed crazysummerW closed 9 months ago
This looks like an OS error. Also, it looks like the error is raised while building the model graph backbone = add_l2_regularizers
. Could it be that you ran out of RAM?
Hi, @akolesnikov I have about 90GB of memory and I only analyzed chromosome 20. I had no problem running DV1.6 on the same server. However, when I specified to use T7 model parameters(--customized_model model/weights-51-0.995354.ckpt), an error occurred. What should I do about this?
Hi,@pichuan Would you mind giving me some advice on the issue I encountered? Thank you very much.
@crazysummerW how was this model created? Did you follow the model training case study? Could you include the command line you used to build the model?
hi, @akolesnikov
In the detailed information of releases v1.6, I noticed that DV1.6 has added new models trained with Complete Genomics data, and added case studies.
I followed your doc:https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-complete-t7-case-study.md
The model file was downloaded from here:
Hi @crazysummerW ,
Looking at your error, it seems like this might be relevant:
File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 241, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 122, in h5py.h5f.create
OSError: [Errno 5] Unable to synchronously create file (unable to lock file, errno = 5, error message = 'Input/output error')
This is because this logic in our code writes a temp file: https://github.com/google/deepvariant/blob/r1.6/deepvariant/keras_modeling.py#L97-L99
tmp_weights_dir = tempfile.gettempdir()
tmp_weights_path = os.path.join(tmp_weights_dir, 'tmp_weights.h5')
model.save_weights(tmp_weights_path)
Can you check your setting, and see if somehow your run wasn't able to create a temp file?
I reran our set up in https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-complete-t7-case-study.md (using a GCP machine as an example) and wasn't able to reproduce that error. So, it'll be very helpful for me to understand your machine setup, and try to make our code more robust in the future.
Thank you!
@pichuan I tested the docker deepvariant:1.6 on a CPU-only machine. And I changed tmp dir:
mkdir -p output/intermediate_results_dir
mkdir -p output/tmp_dir
export TMPDIR="$PWD/output/tmp_dir"
Does this have any impact?
@crazysummerW,
Sorry for the late reply but no, changing the temp directory will not have any affect.
@crazysummerW I had the same issue here.
It turned out to be the problem of the h5 file in the tmp dir. If multiple programs open the h5 simultaneously, the error would occur. So I avoided this by creating a unique tmp dir for each sample, which used a lot of file handles.
@pichuan @kishwarshafin Could you please take a look at this? Probably renaming the h5 file to keep it unique would be a simple and easy solution.
Hi @ZuyaoLiu ,
Just to clarify, you meant that you're running multiple call_variants on the same machine at the same time, so they're all opening the same tmp file?
If that's the case, then I can see that being an issue. I'll file an internal issue to track, and will name the h5 separately. Currently our code is: https://github.com/google/deepvariant/blob/r1.6/deepvariant/keras_modeling.py#L98 and I can see this being an issue when multiple call_variants are run. We'll make sure to create a more unique filepath in the future to avoid issue!
On the other hand, historically we don't recommend running multiple call_variants runs on the same machine. Because TensorFlow will parallelize and use multiple CPUs already.
@ZuyaoLiu @crazysummerW Just for my sanity check, can you confirm that if you run just one call_variants on the machine, then it worked? (I want to make sure there are no other issues)
Hi @ZuyaoLiu ,
Just to clarify, you meant that you're running multiple call_variants on the same machine at the same time, so they're all opening the same tmp file?
If that's the case, then I can see that being an issue. I'll file an internal issue to track, and will name the h5 separately. Currently our code is: https://github.com/google/deepvariant/blob/r1.6/deepvariant/keras_modeling.py#L98 and I can see this being an issue when multiple call_variants are run. We'll make sure to create a more unique filepath in the future to avoid issue!
On the other hand, historically we don't recommend running multiple call_variants runs on the same machine. Because TensorFlow will parallelize and use multiple CPUs already.
@ZuyaoLiu @crazysummerW Just for my sanity check, can you confirm that if you run just one call_variants on the machine, then it worked? (I want to make sure there are no other issues)
@pichuan ,
Yes, you get me correctly. So I run call_variants on a cluster where each node is in charge of a single job. As I set my private tmp dir, these jobs will target the same h5 file, and it will cause the issue.
Currently, I am running the program by setting a unique tmp path to different jobs so that they won't use the same h5 file simultaneously. And they all worked well and finished with no errors.
hi, @pichuan I ran just one call_variants on the machine, but it did not work.
Hi @crazysummerW , so it seems like you might have a different issue. Is it possible that in your setting, you don't have write access to the directory that tempfile.gettempdir()
gave you? I'll need more information from you to pinpoint the issue (because I can't reproduce it on my side yet)
For example, on my machine:
$ python -c 'import tempfile; foo=tempfile.gettempdir(); print(foo)'
/tmp
$ export TMPDIR=${HOME}; python -c 'import tempfile; foo=tempfile.gettempdir(); print(foo)'
/home/pichuan
@crazysummerW , I wonder if it's possible that you don't have write access to your /tmp? If so, can you try setting TMPDIR
to a directory that you have write access to?
Hi @crazysummerW , I'm curious whether you're able to resolve this.
Given that there hasn't been updates for 2 months now, I'll close this for now. But please feel free to reopen if you still have issues, or feel free to post updates if you have new findings. Thank you!
Hello, I tested the T7 model on WGS data using DV1.6, but I keep getting the following error message. I generated the test data using the T7 platform for sequencing. Could you please tell me what went wrong? My cmd:
Error message:
Looking forward to your reply. Thanks.