Error during training with V1.6.0

ZuyaoLiu commented 1 year ago

Dear developers,

When trying to train my own data with the latest 1.6.0, there are some error messages popped up:

It seems like some necessary libraries are missing.

W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs 2023-10-25 17:00:55.064391: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. /usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:

TensorFlow Addons (TFA) has ended development and introduction of new features. TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024. Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP)

Then when finishing, I got this error:

Saving model using saved_model format. WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. model.compile_metrics will be empty until you train or evaluate the model. W1025 22:01:58.210216 140172092593984 saving_utils.py:359] Compiled the loaded model, but the compiled metrics have yet to be built. model.compile_metrics will be empty until you train or evaluate the model. W1025 22:02:31.766536 140172092593984 save.py:271] Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 5 of 94). These functions will not be directly callable after loading. INFO:tensorflow:Assets written to: /home/train_new/checkpoints/ckpt-150/assets I1025 22:02:39.405452 140172092593984 builder_impl.py:797] Assets written to: /home/train_new/checkpoints/ckpt-150/assets WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use status.expect_partial(). See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function. W1025 22:02:44.960290 140172092593984 checkpoint.py:205] Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use status.expect_partial(). See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function. WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.iter W1025 22:02:44.960591 140172092593984 checkpoint.py:214] Value in checkpoint could not be found in the restored object: (root).optimizer.iter WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.awg_optimizer.decay W1025 22:02:44.960684 140172092593984 checkpoint.py:214] Value in checkpoint could not be found in the restored object: (root).optimizer.awg_optimizer.decay WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.awg_optimizer.momentum W1025 22:02:44.960754 140172092593984 checkpoint.py:214] Value in checkpoint could not be found in the restored object: (root).optimizer.awg_optimizer.momentum .....

In the final check point folder, there is nothing in the assets folder.

Thank you.

ZuyaoLiu commented 1 year ago

Solving the first error (libcublas.so.12) by creating a sandbox with singularity and adding the location of libcublas.so.12 to the env. I guess creating a soft link with ln -s would also work.

pichuan commented 1 year ago

Thank you @ZuyaoLiu for the question and the solution. We'll give it a try and improve this.

One question for you @ZuyaoLiu , did you have any issues using Singularity + GPU to run variant calling as well, or are you seeing this issue with training only? (I'm curious because I've tried Singularity+GPU for variant calling, and that worked fine. But I haven't personally tried Singularity+GPU training yet. So I'll first want to see if I can reproduce that issue)

ZuyaoLiu commented 1 year ago

Thank you @ZuyaoLiu for the question and the solution. We'll give it a try and improve this.

One question for you @ZuyaoLiu , did you have any issues using Singularity + GPU to run variant calling as well, or are you seeing this issue with training only? (I'm curious because I've tried Singularity+GPU for variant calling, and that worked fine. But I haven't personally tried Singularity+GPU training yet. So I'll first want to see if I can reproduce that issue)

So far, I haven't tested the variant calling module. Version 1.6.0 was released when I was about to start the training step. Therefore, my SNP matrixes and examples are from v 1.5.0. But I will give it a try after finishing my training step and will let you know if it works.

Best, Zuyao

ZuyaoLiu commented 1 year ago

Thank you @ZuyaoLiu for the question and the solution. We'll give it a try and improve this.

One question for you @ZuyaoLiu , did you have any issues using Singularity + GPU to run variant calling as well, or are you seeing this issue with training only? (I'm curious because I've tried Singularity+GPU for variant calling, and that worked fine. But I haven't personally tried Singularity+GPU training yet. So I'll first want to see if I can reproduce that issue)

Hi @pichuan ,

When I run the GPU version, I got this error mesages, but it still finished SNP calling and all the output files seemed fine.

== CUDA == CUDA Version 11.3.1 2023-10-30 00:30:39.544727: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:40.075465: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:40.617831: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:41.161964: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:41.707151: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:42.254657: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:42.796956: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:43.322127: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:43.879132: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:44.404755: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:44.958146: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:45.488357: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:46.029023: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:46.569218: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:47.101524: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:47.638081: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error

Do you have any idea?

Thank you

pichuan commented 1 year ago

Hi @ZuyaoLiu

When I tested calling and training, I also saw that message. But in both of my calling and training, the GPU was utialized.

We added an entry in FAQ: https://github.com/google/deepvariant/blob/r1.6/docs/FAQ.md#why-am-i-seeing-cuda_error_not_initialized-initialization-error-while-running-on-gpu

and I mentioned that message in https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-training-case-study.md#test-the-model as well.

@ZuyaoLiu , can you help check whether the results of calling is reasonable on your side, and whether GPU is utilized or not?

And, similarly in the training case, some of the warning messages you have might not affect the results. Can you also check whether you can run through the steps (and whether GPU is utilized or not)?

Thank you!

ZuyaoLiu commented 1 year ago

Hi @pichuan ,

Yes, the error messages seem not affect the run and the GPU was utilized during the run. I will run another round of training step, and will check the GPU utilization and the messages.

Also, I have two questions about the training step.

In the Blog (Improved non-human variant calling using species-specific DeepVariant models), in a trio, one offspring was used for establishing the "silver dataset", and five other progenies were used for training and evaluating. So the individuals used for generating the truth dataset and training are different. However, in the deepvariant documents, it seems that the truth dataset, training and evaluating examples all come from HG001, the same individual.

Does the individual used for generating truth dataset and training have to be different? If not, could you please explain why di d you use different individuals in the first case?

If I have multiple trios, can I first call "silver set" with a child from each trio separately, and make examples using the same child from each trio with its silver set, and then shuffle all examples together to train.

Thank you!

Zuyao

pichuan commented 1 year ago

Hi @ZuyaoLiu , In the training tutorial, we used one individual as an example. Our release models are trained on more. You can see https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-details-training-data.md for more information.

The training data we're using are from NIST Genome in a Bottle (GIAB): https://www.nist.gov/programs-projects/genome-bottle We train on data from HG001,HG002,HG004,HG005,HG006,HG007, and leave HG003 out from training.

The "silver dataset" in the blog post you mentioned was used because there isn't high quality truth set for mosquitoes, like the ones for human from GIAB.

Hope this helps.

ZuyaoLiu commented 1 year ago

Hi @ZuyaoLiu

When I tested calling and training, I also saw that message. But in both of my calling and training, the GPU was utialized.

We added an entry in FAQ: https://github.com/google/deepvariant/blob/r1.6/docs/FAQ.md#why-am-i-seeing-cuda_error_not_initialized-initialization-error-while-running-on-gpu

and I mentioned that message in https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-training-case-study.md#test-the-model as well.

@ZuyaoLiu , can you help check whether the results of calling is reasonable on your side, and whether GPU is utilized or not?

And, similarly in the training case, some of the warning messages you have might not affect the results. Can you also check whether you can run through the steps (and whether GPU is utilized or not)?

Thank you!

Hi @pichuan ,

I finished testing with a trio from a non-model species, and found a huge difference. Basically, under the default human model, V 1.6 produces more sites violating mendelian rules than V1.5. The post is here #726 .

pichuan commented 11 months ago

Hi @ZuyaoLiu , I was out so I didn't follow up in the past few weeks. Reading back in the previous comments, is my understanding correct that 1) GPU was being used despite the messages, 2) you have a separate question bout Mendelian violations which is in another GitHub issue.

If the remaining question is covered in another issue, we can close this issue. Otherwise, please remind me again what the current issue related to GPU is. Thank you!

google / deepvariant

Error during training with V1.6.0 #722