pop-os / nvidia-graphics-drivers

Pop!_OS NVIDIA Graphics Drivers
134 stars 7 forks source link

GPU Issues with Tensorflow After Latest System Updates #197

Closed ynusinovich closed 7 months ago

ynusinovich commented 7 months ago

These instructions say that CUDA and cuDNN are already installed in my Adder WS with Pop!_OS 22.04 LTS: https://support.system76.com/articlesf/cuda/ I followed these instructions to install TensorFlow GPU: https://www.tensorflow.org/install/pip I got the following error message when I tried running Tensorflow with GPU. This is new since last week, when everything was working. I did not change the Python environment, I only did Pop!_Shop system updates:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023-11-24 22:38:07.847161: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-24 22:38:07.869081: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-24 22:38:08.248433: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-24 22:38:08.541664: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-24 22:38:08.558329: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

My questions:

  1. How do I fix this error and get the GPU working with TensorFlow again?
  2. Should I stop doing system updates in the Pop!_Shop as they come up? The only thing that changed since Tensorflow was running with GPU was that I did system updates, including the nvidia-driver-545.
ynusinovich commented 7 months ago

Output of conda list, if it's helpful:

# packages in environment at /home/yannusinovich/anaconda3/envs/mlzc2:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   2.0.0                    pypi_0    pypi
asttokens                 2.0.5              pyhd3eb1b0_0  
astunparse                1.6.3                    pypi_0    pypi
backcall                  0.2.0              pyhd3eb1b0_0  
blas                      1.0                         mkl  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2023.08.22           h06a4308_0  
cachetools                5.3.2                    pypi_0    pypi
certifi                   2023.11.17               pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
comm                      0.1.2           py311h06a4308_0  
debugpy                   1.6.7           py311h6a678d5_0  
decorator                 5.1.1              pyhd3eb1b0_0  
executing                 0.8.3              pyhd3eb1b0_0  
flatbuffers               23.5.26                  pypi_0    pypi
freetype                  2.12.1               h4a9f257_0  
gast                      0.4.0                    pypi_0    pypi
giflib                    5.2.1                h5eee18b_3  
google-auth               2.23.4                   pypi_0    pypi
google-auth-oauthlib      1.0.0                    pypi_0    pypi
google-pasta              0.2.0                    pypi_0    pypi
grpcio                    1.59.3                   pypi_0    pypi
h5py                      3.10.0                   pypi_0    pypi
idna                      3.5                      pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46306  
ipykernel                 6.25.0          py311h92b7b1e_0  
ipython                   8.15.0          py311h06a4308_0  
jedi                      0.18.1          py311h06a4308_1  
jpeg                      9e                   h5eee18b_1  
jupyter_client            8.6.0           py311h06a4308_0  
jupyter_core              5.5.0           py311h06a4308_0  
keras                     2.13.1                   pypi_0    pypi
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libclang                  16.0.6                   pypi_0    pypi
libdeflate                1.17                 h5eee18b_1  
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 11.2.0               h1234567_1  
libgfortran-ng            11.2.0               h00389a5_1  
libgfortran5              11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libpng                    1.6.39               h5eee18b_0  
libsodium                 1.0.18               h7b6447c_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtiff                   4.5.1                h6a678d5_0  
libuuid                   1.41.5               h5eee18b_0  
libwebp                   1.3.2                h11a3e52_0  
libwebp-base              1.3.2                h5eee18b_0  
lz4-c                     1.9.4                h6a678d5_0  
markdown                  3.5.1                    pypi_0    pypi
markupsafe                2.1.3                    pypi_0    pypi
matplotlib-inline         0.1.6           py311h06a4308_0  
mkl                       2023.1.0         h213fc3f_46344  
mkl-service               2.4.0           py311h5eee18b_1  
mkl_fft                   1.3.8           py311h5eee18b_0  
mkl_random                1.2.4           py311hdb19cb5_0  
ncurses                   6.4                  h6a678d5_0  
nest-asyncio              1.5.6           py311h06a4308_0  
numpy                     1.24.3                   pypi_0    pypi
oauthlib                  3.2.2                    pypi_0    pypi
openjpeg                  2.4.0                h3ad879b_0  
openssl                   3.0.12               h7f8727e_0  
opt-einsum                3.3.0                    pypi_0    pypi
packaging                 23.1            py311h06a4308_0  
parso                     0.8.3              pyhd3eb1b0_0  
pexpect                   4.8.0              pyhd3eb1b0_3  
pickleshare               0.7.5           pyhd3eb1b0_1003  
pillow                    10.0.1          py311ha6cbd5a_0  
pip                       23.3.1          py311h06a4308_0  
platformdirs              3.10.0          py311h06a4308_0  
prompt-toolkit            3.0.36          py311h06a4308_0  
protobuf                  4.25.1                   pypi_0    pypi
psutil                    5.9.0           py311h5eee18b_0  
ptyprocess                0.7.0              pyhd3eb1b0_2  
pure_eval                 0.2.2              pyhd3eb1b0_0  
pyasn1                    0.5.1                    pypi_0    pypi
pyasn1-modules            0.3.0                    pypi_0    pypi
pygments                  2.15.1          py311h06a4308_1  
python                    3.11.5               h955ad1f_0  
python-dateutil           2.8.2              pyhd3eb1b0_0  
pyzmq                     25.1.0          py311h6a678d5_0  
readline                  8.2                  h5eee18b_0  
requests                  2.31.0                   pypi_0    pypi
requests-oauthlib         1.3.1                    pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
scipy                     1.11.3          py311h08b1b3b_0  
setuptools                68.0.0          py311h06a4308_0  
six                       1.16.0             pyhd3eb1b0_1  
sqlite                    3.41.2               h5eee18b_0  
stack_data                0.2.0              pyhd3eb1b0_0  
tbb                       2021.8.0             hdb19cb5_0  
tensorboard               2.13.0                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
tensorflow                2.13.1                   pypi_0    pypi
tensorflow-estimator      2.13.0                   pypi_0    pypi
tensorflow-io-gcs-filesystem 0.34.0                   pypi_0    pypi
termcolor                 2.3.0                    pypi_0    pypi
tflite-runtime            2.14.0                   pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
tornado                   6.3.3           py311h5eee18b_0  
traitlets                 5.7.1           py311h06a4308_0  
typing-extensions         4.5.0                    pypi_0    pypi
tzdata                    2023c                h04d1e81_0  
urllib3                   2.1.0                    pypi_0    pypi
wcwidth                   0.2.5              pyhd3eb1b0_0  
werkzeug                  3.0.1                    pypi_0    pypi
wheel                     0.41.2          py311h06a4308_0  
wrapt                     1.16.0                   pypi_0    pypi
xz                        5.4.2                h5eee18b_0  
zeromq                    4.3.4                h2531618_0  
zlib                      1.2.13               h5eee18b_0  
zstd                      1.5.5                hc292b87_0  
Tostino commented 7 months ago

545 also broke pytorch multi gpu. I'm at my wits end with pop, they have wasted a week of my life with their shitty driver packaging in the past 6 months.

The fact that I can't downgrade is really, really cool. Good job team.

mmstick commented 7 months ago

TF-TRT Warning: Could not find TensorRT

https://discuss.tensorflow.org/t/unable-to-get-tensorflow-working-correctly/18981/2

You are missing the TensorRT library path from your LD_LIBRARY_PATH. But you should be looking into switching over to Docker as these tools typically depend on specific versions of NVIDIA drivers and the CUDA toolkit.

@Tostino There's no issues with our packaging, and this issue has nothing to do with the NVIDIA driver. They're missing the TensorRT library path from their LD paths.

NVIDIA provides the driver installer and we package that installer. You get precisely what NVIDIA has packaged in their installer. Our QA team tests every driver release, and that includes Tensorflow testing using Docker.

Tostino commented 7 months ago

I meant the lack of ability to downgrade, and automatically updating to newer drivers has caused issues with the major ML packages.

Sorry if I piled on an unrelated issue, but there are widespread issues with the 545 drivers, and not being able to go back to 535 easily has been a serious hassle.

Similar issues with poor software support happened when 535 originally came out and I was upgraded to those. Unless I am just missing something here and downgrading works fine and I'm just slow and doing something wrong...in which case, apologies, and please ignore me.

mmstick commented 7 months ago

nvidia-driver-535-server is an option, but we aren't testing these server packages from Ubuntu.

Tostino commented 7 months ago

Appreciate you mentioning that. I thought that server meant headless in this case, but it looks like I was wrong and these should work. Will give it a shot, Appreciate it.

ynusinovich commented 7 months ago

@mmstick Thank you for your help.

The page that I had sent you (https://www.tensorflow.org/install/pip) updated their instructions. Now it has a new first line with additional packages to install:

python3 -m pip install --extra-index-url https://pypi.nvidia.com tensorrt-bindings==8.6.1 tensorrt-libs==8.6.1
python3 -m pip install -U tensorflow[and-cuda]
# Verify the installation:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

And now Tensorflow works with GPU.

FYI TensorRT is still not working when I import Tensorflow, but I don't need it to for this practice project... I tried adding export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/yannusinovich/anaconda3/envs/mlzc2/lib/python3.11/site-packages/tensorrt to .bashrc and as an environment variable in my Jupyter notebook, but it still can't find TensorRT.

I definitely concur with @Tostino that it would be nice if there was an option to quickly downgrade to a previous version of the NVIDIA drivers in the Pop!_Shop for situations where the compatibility temporarily dies with an update.