system76 / cuda

Packaging for NVIDIA's CUDA Toolkit
21 stars 10 forks source link

tensorflow-gpu warnings with cuda-11.2 #17

Open pkiri056 opened 3 years ago

pkiri056 commented 3 years ago

Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda-11.2/lib

Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/lib64

The above warning was after installing the NVIDIA CUDA Toolkit using the following commands sudo apt install system76-cuda-latest sudo apt install system76-cudnn-10.2

Yiming-M commented 3 years ago

Why you installed an incompatible version of CUDNN? Currently, there is no cudnn-11.2 in the repository, so you might want to downgrade your CUDA version to 11.1 and install system76-cudnn-11.1.

And BTW the newest stable version of TensorFlow is 2.4 which support CUDA 11.0 (which is also not in the repository), so you may need to install it from Nvidia.

EDIT: I have tried TensorFlow 2.4 with system76-cuda-11.1 and system76-cudnn-11.1 with Keras Simple MNIST Convnet, and the code can run without error.

Epoch 13/15
422/422 [==============================] - 6s 14ms/step - loss: 0.0220 - accuracy: 0.9933 - val_loss: 0.0290 - val_accuracy: 0.9932
Epoch 14/15
422/422 [==============================] - 6s 15ms/step - loss: 0.0224 - accuracy: 0.9922 - val_loss: 0.0277 - val_accuracy: 0.9937
Epoch 15/15
422/422 [==============================] - 6s 14ms/step - loss: 0.0200 - accuracy: 0.9932 - val_loss: 0.0315 - val_accuracy: 0.9933
theofpa commented 3 years ago

cuda 11.2 is now the latest and still has the problem as cudnn-11.2 is not available:

$ dpkg -s system76-cuda-latest
Package: system76-cuda-latest
Status: install ok installed
Priority: optional
Section: metapackages
Installed-Size: 9
Maintainer: Michael Aaron Murphy <michael@system76.com>
Architecture: all
Multi-Arch: foreign
Version: 11.2~20.04
Depends: system76-cuda-11.2
Description: Metapackage for the latest version of the CUDA Toolkit
Homepage: https://developer.nvidia.com/cuda-downloads
$ dpkg -l|grep cud
ii  libcudart10.1:amd64                              10.1.243-3                                                amd64        NVIDIA CUDA Runtime Library
ii  system76-cuda                                    0pop1                                                     amd64        NVIDIA CUDA Compiler / Libraries / Toolkit Metapackage
ii  system76-cuda-11.1                               0pop1                                                     amd64        NVIDIA CUDA 11.1 Compiler / Libraries / Toolkit
ii  system76-cuda-11.2                               0pop1                                                     amd64        NVIDIA CUDA 11.2 Compiler / Libraries / Toolkit
ii  system76-cuda-latest                             11.2~20.04                                                all          Metapackage for the latest version of the CUDA Toolkit
ii  system76-cudnn-11.1                              8.0.4                                                     amd64        NVIDIA CUDA Deep Neural Network library (cuDNN) for CUDA 11.1

removing the latest version fixes the error:

sudo apt-get remove system76-cuda-11.2
Python 3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-05-19 10:36:48.625315: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.config.list_physical_devices('GPU')
2021-05-19 10:36:51.734239: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-05-19 10:36:51.762295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 10:36:51.762755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.4425GHz coreCount: 6 deviceMemorySize: 3.94GiB deviceMemoryBandwidth: 104.43GiB/s
2021-05-19 10:36:51.762797: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-05-19 10:36:51.766115: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-05-19 10:36:51.766185: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-05-19 10:36:51.767462: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-05-19 10:36:51.767767: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-05-19 10:36:51.771203: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-05-19 10:36:51.771963: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-05-19 10:36:51.772119: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-05-19 10:36:51.772257: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 10:36:51.772685: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-19 10:36:51.773004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Adapted the instructions in system76/docs#598