After upgrading Pop!_OS to 21.10, Tensorman can no longer access the GPU

shawnmjones commented 2 years ago

I have a workaround for this issue and it took some doing to get right. I felt the need to report this because Tensorman no longer works as expected out of the box. At a minimum, I hope reporting this issue might help someone else.

Hardware specs:

Thelio Mira R1
CPU: 4.8 GHz AMD Ryzen 9 5900X (3.7 up to 4.8 GHz - 12 Cores - 24 Threads)
RAM: 128 GB Dual Channel ECC DDR4 @ 3200MHz (4 x 32GB)
GPU: 12 GB GeForce RTX 3080 Ti with 10240 CUDA Cores

OS specs:

Pop!_OS 21.10
uname output: Linux shai-hulud 5.15.15-76051515-generic #202201160435~1642693824~21.10~97db1bb SMP Thu Jan 20 17:35:05 U x86_64 x86_64 x86_64 GNU/Linux

Description of problem: I upgraded from Pop!_OS 21.04 to 21.10 in December. Two weeks ago, I noticed that Pop!_Shop would not let me upgrade everything. It gave me the following error:

The following packages have unmet dependencies:
  nvidia-container-toolkit: Breaks: nvidia-container-runtime (<= 3.5.0-1) but 3.5.0-1~1626361786~21.04~b844140 is to be installed

I removed nvidia-container-toolkit and nvidia-container-runtime and installed nvidia-container-toolkit and nvidia-docker2 in their place, as suggested by apt install. This got rid of the error from Pop!_Shop.

This is my Tensorman test script:

#!/usr/bin/python3
import tensorflow as tf

print("just imported tensorflow")
print()

# needed per https://github.com/tensorflow/tensorflow/issues/42738
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices:
    print("configuring device {} to set memory growth".format(device))
    tf.config.experimental.set_memory_growth(device, True)
# but this does not silence errors like:
# 2021-11-28 20:00:17.018124: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

print()
print("starting main script")

hello = tf.constant('Hello, TensorFlow!')
tf.print(hello)
tf.print('Using TensorFlow version: ' + tf.__version__)
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
tf.print(c)

I ran the Tensorman test script and got the following output:

# tensorman run --gpu python ./hello-world.py 
"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/data/data1/smj-Unsynced-Projects/tensorman-learning:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "python" "./hello-world.py"
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

I reviewed this nvidia-docker issue and decided to change # no-cgroups = false to no-cgroups = true in /etc/nvidia-container-runtime/config.toml.

This got me closer.

# tensorman run --gpu python ./hello-world.py 
"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/data/data1/smj-Unsynced-Projects/tensorman-learning:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "python" "./hello-world.py"
just imported tensorflow

2022-01-22 17:57:36.178216: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-01-22 17:57:36.178239: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 7f120b1f3deb
2022-01-22 17:57:36.178245: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 7f120b1f3deb
2022-01-22 17:57:36.178286: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 470.86.0
2022-01-22 17:57:36.178300: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.86.0
2022-01-22 17:57:36.178305: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 470.86.0

starting main script
2022-01-22 17:57:36.178509: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Hello, TensorFlow!
Using TensorFlow version: 2.7.0
[[22 28]
 [49 64]]

However, it still states no CUDA-capable device is detected.

From that same nvidia-docker issue I found other docker commands that executed nvidia-smi in a container. I tried them on my Thelio and they did return information about the GPU from the container, so I surmised that perhaps Tensorman was not communicating this device information.

Workaround: I placed a Tensorman.toml in the same directory as hello-world.py with the following content:

docker_flags = [ '--device', '/dev/nvidia0',  '--device', '/dev/nvidia-uvm', '--device', '/dev/nvidia-uvm-tools', '--device', '/dev/nvidiactl' ]

and now I get this output:

# tensorman run --gpu python ./hello-world.py
"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "--device" "/dev/nvidia0" "--device" "/dev/nvidia-uvm" "--device" "/dev/nvidia-uvm-tools" "--device" "/dev/nvidiactl" "-it" "--rm" "-v" "/data/data1/smj-Unsynced-Projects/tensorman-learning:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "python" "./hello-world.py"
just imported tensorflow

2022-01-22 19:03:13.135070: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-22 19:03:13.138300: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-22 19:03:13.138656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
configuring device PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') to set memory growth

starting main script
2022-01-22 19:03:13.139195: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-22 19:03:13.139929: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-22 19:03:13.140271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-22 19:03:13.140607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-22 19:03:13.436152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-22 19:03:13.436505: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-22 19:03:13.436841: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-22 19:03:13.437150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10022 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti, pci bus id: 0000:0a:00.0, compute capability: 8.6
Hello, TensorFlow!
Using TensorFlow version: 2.7.0
2022-01-22 19:03:13.772322: I tensorflow/stream_executor/cuda/cuda_blas.cc:1774] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
[[22 28]
 [49 64]]

which, in spite of the numerous successful NUMA node read from SysFS had negative value (-1) warnings, is the same output I was getting before the upgrade, and the output produces the expected information about the GPU.

I tried uninstalling and reinstalling tensorman and it did not fix the issue. I don't mind specifying this in Tensorman.toml, but it did not require this previously.

At a minimum, the documentation at https://support.system76.com/articles/tensorman/ should be updated for the new Nvidia packages.

Has anyone else noticed this? Is this the intended workaround going forward or do we need to fix the defaults to match what I've put in my Tensorman.toml?

mmstick commented 2 years ago

Running sudo kernelstub -a systemd.unified_cgroup_hierarchy=0 will revert to cgroups v1 and get the behavior nvidia-container-toolkit currently expects. That's the only change necessary at the moment, until NVIDIA releases 1.8.0 with the cgroupv2 fixes.

shawnmjones commented 2 years ago

Thanks for the insight! I put the config back and ran the kernelstub command that you recommended. That did the trick.

B3njimin commented 2 years ago

Running sudo kernelstub -a systemd.unified_cgroup_hierarchy=0 will revert to cgroups v1 and get the behavior nvidia-container-toolkit currently expects. That's the only change necessary at the moment, until NVIDIA releases 1.8.0 with the cgroupv2 fixes.

Hello, I apologies I am very naive but wonder at what stage I should put this command, I just tried to run a docker with the command

tensorman run --gpu --python3 --jupyter bash

but I am faced with the error Error response from daemon: failed to create shim: OCI runtime create failed: and nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

Apologies for my naivety, I'm pretty new to Linux and dockers.

B3njimin commented 2 years ago

Solved.

I am using a GRUB bootloader rather than a kernal stub, this is because I am dualbooting pop os with on a razer laptop with an RTX3070.

sudo nano /etc/default/grub
# add this line : GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"
sudo update-grub
sudo reboot

This solved it for me.

pop-os / tensorman

After upgrading Pop!_OS to 21.10, Tensorman can no longer access the GPU #34