GPU Issues with Tensorflow After Latest System Updates #197

ynusinovich commented 7 months ago

These instructions say that CUDA and cuDNN are already installed in my Adder WS with Pop!_OS 22.04 LTS: https://support.system76.com/articlesf/cuda/ I followed these instructions to install TensorFlow GPU: https://www.tensorflow.org/install/pip I got the following error message when I tried running Tensorflow with GPU. This is new since last week, when everything was working. I did not change the Python environment, I only did Pop!_Shop system updates:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023-11-24 22:38:07.847161: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-24 22:38:07.869081: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-24 22:38:08.248433: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-11-24 22:38:08.541664: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-24 22:38:08.558329: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

My questions:

  1. How do I fix this error and get the GPU working with TensorFlow again?
  2. Should I stop doing system updates in the Pop!_Shop as they come up? The only thing that changed since Tensorflow was running with GPU was that I did system updates, including the nvidia-driver-545.
ynusinovich commented 7 months ago

Output of conda list, if it's helpful:

Tostino commented 7 months ago

545 also broke pytorch multi gpu. I'm at my wits end with pop, they have wasted a week of my life with their shitty driver packaging in the past 6 months.

The fact that I can't downgrade is really, really cool. Good job team.

mmstick commented 7 months ago

TF-TRT Warning: Could not find TensorRT


You are missing the TensorRT library path from your LD_LIBRARY_PATH. But you should be looking into switching over to Docker as these tools typically depend on specific versions of NVIDIA drivers and the CUDA toolkit.

@Tostino There's no issues with our packaging, and this issue has nothing to do with the NVIDIA driver. They're missing the TensorRT library path from their LD paths.

NVIDIA provides the driver installer and we package that installer. You get precisely what NVIDIA has packaged in their installer. Our QA team tests every driver release, and that includes Tensorflow testing using Docker.

Tostino commented 7 months ago

I meant the lack of ability to downgrade, and automatically updating to newer drivers has caused issues with the major ML packages.

Sorry if I piled on an unrelated issue, but there are widespread issues with the 545 drivers, and not being able to go back to 535 easily has been a serious hassle.

Similar issues with poor software support happened when 535 originally came out and I was upgraded to those. Unless I am just missing something here and downgrading works fine and I'm just slow and doing something wrong...in which case, apologies, and please ignore me.

mmstick commented 7 months ago

nvidia-driver-535-server is an option, but we aren't testing these server packages from Ubuntu.

Tostino commented 7 months ago

Appreciate you mentioning that. I thought that server meant headless in this case, but it looks like I was wrong and these should work. Will give it a shot, Appreciate it.

ynusinovich commented 7 months ago

@mmstick Thank you for your help.

The page that I had sent you (https://www.tensorflow.org/install/pip) updated their instructions. Now it has a new first line with additional packages to install:

python3 -m pip install --extra-index-url https://pypi.nvidia.com tensorrt-bindings==8.6.1 tensorrt-libs==8.6.1
python3 -m pip install -U tensorflow[and-cuda]
# Verify the installation:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

And now Tensorflow works with GPU.

FYI TensorRT is still not working when I import Tensorflow, but I don't need it to for this practice project... I tried adding export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/yannusinovich/anaconda3/envs/mlzc2/lib/python3.11/site-packages/tensorrt to .bashrc and as an environment variable in my Jupyter notebook, but it still can't find TensorRT.

I definitely concur with @Tostino that it would be nice if there was an option to quickly downgrade to a previous version of the NVIDIA drivers in the Pop!_Shop for situations where the compatibility temporarily dies with an update.