Not running with GPU - Githubissues

YunjieChang commented 2 years ago

Hi Tim,

I just installed cryoCARE on our HPC following the installation procedure "For CUDA 10" and did not meet any errors during the installation.

However, I got the following message when I tried to run the training process (cryoCARE_train.py --conf train_config.json):

================================
2022-08-31 11:33:43.111390: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
0 1
1 72
2 72
3 72
4 1
2022-08-31 11:33:43.730687: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-08-31 11:33:43.731272: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2200000000 Hz
=================================

This information says that cryoCARE is not using GPU to do the training, instead it is using CPU, therefore, it is quite slow. My tomogram size is 672672200.

Any idea about this issue? Thanks! Yunjie

tibuch commented 1 year ago

Hi Yunjie,

Does TensorFlow see the GPU on your cluster node where you are running the training? I would recommend to start an interactive cluster session and then check if the GPU is available with nvidia-smi. Then you can check if the installed CUDA is compatible with your TensorFlow installation and finally I would run this TensorFlow installation verification code from their install instructions (https://www.tensorflow.org/install/pip):

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Cheers!

tailinhua16 commented 1 year ago

Hi Tim, I've encountered a similar issue where cryocare doesn't use GPU, I'm using a workstation instead of a cluser, when I use the verification code you mentioned, the output was:

2023-08-23 00:26:26.044988: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2023-08-23 00:26:27.342214: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2023-08-23 00:26:27.343332: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2023-08-23 00:26:27.374010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:1a:00.0 name: Quadro RTX 5000 computeCapability: 7.5 coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s 2023-08-23 00:26:27.374707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: pciBusID: 0000:1b:00.0 name: Quadro RTX 5000 computeCapability: 7.5 coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s 2023-08-23 00:26:27.375389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties: pciBusID: 0000:3d:00.0 name: Quadro RTX 5000 computeCapability: 7.5 coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s 2023-08-23 00:26:27.376012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties: pciBusID: 0000:3e:00.0 name: Quadro RTX 5000 computeCapability: 7.5 coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s 2023-08-23 00:26:27.376638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 4 with properties: pciBusID: 0000:88:00.0 name: Quadro RTX 5000 computeCapability: 7.5 coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s 2023-08-23 00:26:27.377264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 5 with properties: pciBusID: 0000:89:00.0 name: Quadro RTX 5000 computeCapability: 7.5 coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s 2023-08-23 00:26:27.377868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 6 with properties: pciBusID: 0000:b1:00.0 name: Quadro RTX 5000 computeCapability: 7.5 coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s 2023-08-23 00:26:27.378519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 7 with properties: pciBusID: 0000:b2:00.0 name: Quadro RTX 5000 computeCapability: 7.5 coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s 2023-08-23 00:26:27.378561: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2023-08-23 00:26:27.382425: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/linhua/Programs/anaconda3/envs/cryocare_11/bin/../lib/libcublas.so.11: symbol free_gemm_select, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-9.1/lib64:/usr/local/cuda-8.0/lib64:/usr/local/cuda/lib64:/usr/local/cuda-11.8/lib64:/opt/OpenMPI/lib:/opt/OpenMPI/lib::/usr/local/cuda-10.0/lib64 2023-08-23 00:26:27.384977: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2023-08-23 00:26:27.386268: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2023-08-23 00:26:27.386513: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2023-08-23 00:26:27.389644: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2023-08-23 00:26:27.390326: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2023-08-23 00:26:27.390447: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2023-08-23 00:26:27.390471: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... []

I'm using cryocare_11, any idea how to solve this problem? Thank you very much in advance! Yours, Linhua Tai

juglab / cryoCARE_pip

Not running with GPU #25

Hi Tim, I've encountered a similar issue where cryocare doesn't use GPU, I'm using a workstation instead of a cluser, when I use the verification code you mentioned, the output was: