google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.18k stars 718 forks source link

CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected #820

Closed kunmonster closed 3 months ago

kunmonster commented 4 months ago

Hi , when i run call_variant , it arises this warn which means can't use the gpu,but i can make sure that the tensorflow can use the gpu.There are the screen shots of the warn and the existence of the gpu.

tensorflow.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2024-05-12 21:36:00.744470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /device:GPU:0 with 15089 MB memory:  -> device: 0, name: Vega 20, pci bus id: 0000:26:00.0
True
  warnings.warn(
2024-05-12 21:43:29.067332: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
akolesnikov commented 4 months ago

Hi @kunmonster , Could you please provide a command line that you use to run DeepVariant?

kunmonster commented 4 months ago

Hi @kunmonster , Could you please provide a command line that you use to run DeepVariant?

Sorry,the command line is in the top of the second picture. Actually,i run the docker container in interactive mode,then run the call_variants line within the container

akolesnikov commented 4 months ago

Since sometimes warning messages from Tensorflow may be misleading. Could you please try running call_variants and at the same time monitor the GPU load to make sure the GPU is not used? You can use watch -n0.5 nvidia-smi to check the GPU load in real time.

kunmonster commented 4 months ago

Happy to see the reply,actually, i run this on hpc and the usage of the gpu is very low , but the cpu and memory usage is extremely high , then i run this command line on my laptop with gtx 1050ti and compare the time of the prediction one batch ,the time in the hpc is longer than my laptop , but the truth is the performance of the hpc gpu is better than gtx1050ti. So, the gpu don't work. I will post what you want later.Thx!

kunmonster commented 4 months ago

There are the pictures for the usage of gpu image image

akolesnikov commented 4 months ago

Did you run tensorflow.test.is_gpu_available() from the DeepVariant docker?

Could you try the suggestion from this thread

kunmonster commented 4 months ago

Did you run tensorflow.test.is_gpu_available() from the DeepVariant docker?

Yes i did , I have posted the result which shows that in python shell the gpu can be identified with tensorflow in the first comment.

Could you try the suggestion from this thread

Actuallly i have tried to set CUDA_VISIBLE_DEVICES=0 in System ENV ,it did't work .So I tried to find the place where sets the value of the env in your code , and want to set the CUDA_VISIBLE_DEVICES=0 , but i did't find. So ,i turn to ask for your help.

I think the reason why the error occurs may be in your code the value of the CUDA_VISIBLE_DEVICES does't match with my device.

pichuan commented 3 months ago

Hi @kunmonster , if I understand correctly, we haven't been able to reproduce the issue on our side, therefore it has been difficult for us to help.

I know you've already run tensorflow.test.is_gpu_available(). can you also try this and see what you see?

sudo docker run --gpus all google/deepvariant:1.6.1-gpu python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

When I run this on my machine, I see:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Given that your is_gpu_available is True, I think it'll also list something. Just want to double check here.

If there's any other information that you can provide, in order for us to reproduce on our side, please let us know.

kunmonster commented 3 months ago

Thanks @pichuan, I think this problem may be caused by the hpc what i am using,actually the gpus of this hpc are not from nvidia,instead they are made for deeplearning specifically in specific frame, and i can't run docker directly on the hpc. It doesn't matter, i will find another computer with gpu from nvidia to run this . Thx.

louis-kento commented 1 month ago

I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-15 14:02:47.618984: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-08-15 14:02:47.619048: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2024-08-15 14:02:50.434353: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NOT_FOUND: named symbol not found
[]
```I am encountering errors while testing deepvariants calling with gpu. It seems that some libraries for cuda are missing causing it to only work with cpu