Closed ZuyaoLiu closed 11 months ago
Solving the first error (libcublas.so.12) by creating a sandbox with singularity and adding the location of libcublas.so.12 to the env. I guess creating a soft link with ln -s would also work.
Thank you @ZuyaoLiu for the question and the solution. We'll give it a try and improve this.
One question for you @ZuyaoLiu , did you have any issues using Singularity + GPU to run variant calling as well, or are you seeing this issue with training only? (I'm curious because I've tried Singularity+GPU for variant calling, and that worked fine. But I haven't personally tried Singularity+GPU training yet. So I'll first want to see if I can reproduce that issue)
Thank you @ZuyaoLiu for the question and the solution. We'll give it a try and improve this.
One question for you @ZuyaoLiu , did you have any issues using Singularity + GPU to run variant calling as well, or are you seeing this issue with training only? (I'm curious because I've tried Singularity+GPU for variant calling, and that worked fine. But I haven't personally tried Singularity+GPU training yet. So I'll first want to see if I can reproduce that issue)
So far, I haven't tested the variant calling module. Version 1.6.0 was released when I was about to start the training step. Therefore, my SNP matrixes and examples are from v 1.5.0. But I will give it a try after finishing my training step and will let you know if it works.
Best, Zuyao
Thank you @ZuyaoLiu for the question and the solution. We'll give it a try and improve this.
One question for you @ZuyaoLiu , did you have any issues using Singularity + GPU to run variant calling as well, or are you seeing this issue with training only? (I'm curious because I've tried Singularity+GPU for variant calling, and that worked fine. But I haven't personally tried Singularity+GPU training yet. So I'll first want to see if I can reproduce that issue)
Hi @pichuan ,
When I run the GPU version, I got this error mesages, but it still finished SNP calling and all the output files seemed fine.
== CUDA == CUDA Version 11.3.1 2023-10-30 00:30:39.544727: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:40.075465: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:40.617831: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:41.161964: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:41.707151: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:42.254657: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:42.796956: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:43.322127: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:43.879132: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:44.404755: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:44.958146: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:45.488357: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:46.029023: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:46.569218: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:47.101524: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error 2023-10-30 00:30:47.638081: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1278] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error
Do you have any idea?
Thank you
Hi @ZuyaoLiu
When I tested calling and training, I also saw that message. But in both of my calling and training, the GPU was utialized.
We added an entry in FAQ: https://github.com/google/deepvariant/blob/r1.6/docs/FAQ.md#why-am-i-seeing-cuda_error_not_initialized-initialization-error-while-running-on-gpu
and I mentioned that message in https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-training-case-study.md#test-the-model as well.
@ZuyaoLiu , can you help check whether the results of calling is reasonable on your side, and whether GPU is utilized or not?
And, similarly in the training case, some of the warning messages you have might not affect the results. Can you also check whether you can run through the steps (and whether GPU is utilized or not)?
Thank you!
Hi @pichuan ,
Yes, the error messages seem not affect the run and the GPU was utilized during the run. I will run another round of training step, and will check the GPU utilization and the messages.
Also, I have two questions about the training step.
Does the individual used for generating truth dataset and training have to be different? If not, could you please explain why di d you use different individuals in the first case?
Thank you!
Zuyao
Hi @ZuyaoLiu , In the training tutorial, we used one individual as an example. Our release models are trained on more. You can see https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-details-training-data.md for more information.
The training data we're using are from NIST Genome in a Bottle (GIAB): https://www.nist.gov/programs-projects/genome-bottle We train on data from HG001,HG002,HG004,HG005,HG006,HG007, and leave HG003 out from training.
The "silver dataset" in the blog post you mentioned was used because there isn't high quality truth set for mosquitoes, like the ones for human from GIAB.
Hope this helps.
Hi @ZuyaoLiu
When I tested calling and training, I also saw that message. But in both of my calling and training, the GPU was utialized.
We added an entry in FAQ: https://github.com/google/deepvariant/blob/r1.6/docs/FAQ.md#why-am-i-seeing-cuda_error_not_initialized-initialization-error-while-running-on-gpu
and I mentioned that message in https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-training-case-study.md#test-the-model as well.
@ZuyaoLiu , can you help check whether the results of calling is reasonable on your side, and whether GPU is utilized or not?
And, similarly in the training case, some of the warning messages you have might not affect the results. Can you also check whether you can run through the steps (and whether GPU is utilized or not)?
Thank you!
Hi @pichuan ,
I finished testing with a trio from a non-model species, and found a huge difference. Basically, under the default human model, V 1.6 produces more sites violating mendelian rules than V1.5. The post is here #726 .
Hi @ZuyaoLiu , I was out so I didn't follow up in the past few weeks. Reading back in the previous comments, is my understanding correct that 1) GPU was being used despite the messages, 2) you have a separate question bout Mendelian violations which is in another GitHub issue.
If the remaining question is covered in another issue, we can close this issue. Otherwise, please remind me again what the current issue related to GPU is. Thank you!
Dear developers,
When trying to train my own data with the latest 1.6.0, there are some error messages popped up:
It seems like some necessary libraries are missing.
W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs 2023-10-25 17:00:55.064391: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. /usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:
TensorFlow Addons (TFA) has ended development and introduction of new features. TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024. Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP)
Then when finishing, I got this error:
Saving model using saved_model format. WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built.
model.compile_metrics
will be empty until you train or evaluate the model. W1025 22:01:58.210216 140172092593984 saving_utils.py:359] Compiled the loaded model, but the compiled metrics have yet to be built.model.compile_metrics
will be empty until you train or evaluate the model. W1025 22:02:31.766536 140172092593984 save.py:271] Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 5 of 94). These functions will not be directly callable after loading. INFO:tensorflow:Assets written to: /home/train_new/checkpoints/ckpt-150/assets I1025 22:02:39.405452 140172092593984 builder_impl.py:797] Assets written to: /home/train_new/checkpoints/ckpt-150/assets WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, usestatus.expect_partial()
. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function. W1025 22:02:44.960290 140172092593984 checkpoint.py:205] Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, usestatus.expect_partial()
. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function. WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.iter W1025 22:02:44.960591 140172092593984 checkpoint.py:214] Value in checkpoint could not be found in the restored object: (root).optimizer.iter WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.awg_optimizer.decay W1025 22:02:44.960684 140172092593984 checkpoint.py:214] Value in checkpoint could not be found in the restored object: (root).optimizer.awg_optimizer.decay WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.awg_optimizer.momentum W1025 22:02:44.960754 140172092593984 checkpoint.py:214] Value in checkpoint could not be found in the restored object: (root).optimizer.awg_optimizer.momentum .....In the final check point folder, there is nothing in the assets folder.
Thank you.