pop-os / tensorman

Utility for easy management of Tensorflow containers
GNU General Public License v3.0
206 stars 16 forks source link

Error: no capabilities with [[gpu]], but GPU is available #14

Closed drscotthawley closed 4 years ago

drscotthawley commented 4 years ago

Hi, I'm trying to invoke a basic tensorman run with the GPU as per the documentation, and it's saying it can't talk to the GPU. But nvidia-smi shows the GPU is available.

I'm running the latest Pop!_OS with all updates applied. Driver is nvidia-driver-440. Running in Hybrid Graphics mode.

~/Downloads/tensorman/examples$ tensorman run --gpu --python3 bash
"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/shawley/Downloads/tensorman/examples:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu-py3" "bash"
Unable to find image 'tensorflow/tensorflow:latest-gpu-py3' locally
latest-gpu-py3: Pulling from tensorflow/tensorflow
7ddbc47eeb70: Pull complete 
c1bbdc448b72: Pull complete 
8c3b70e39044: Pull complete 
45d437916d57: Pull complete 
d8f1569ddae6: Pull complete 
85386706b020: Pull complete 
ee9b457b77d0: Pull complete 
bebfcc1316f7: Pull complete 
644140fd95a9: Pull complete 
d6c0f989e873: Pull complete 
7a8e64f26211: Pull complete 
c33b03e4dd22: Pull complete 
bca93af797c1: Pull complete 
47f6c197be35: Pull complete 
e5da48aa9554: Pull complete 
ca68d98a90c4: Pull complete 
Digest: sha256:1010e051dde4a9b62532a80f4a9a619013eafc78491542d5ef5da796cc2697ae
Status: Downloaded newer image for tensorflow/tensorflow:latest-gpu-py3
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

But a quick check shows that the GPU is available:

~/Downloads/tensorman/examples$ nvidia-smi
Thu Jan 23 22:01:11 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 207...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P8     6W /  N/A |     31MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1655      G   /usr/lib/xorg/Xorg                            14MiB |
|    0      2458      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+

Same error happens trying to run the beginners example:

~/Downloads/tensorman/examples$ tensorman run --gpu --python3 python -- beginners/main.py
"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/shawley/Downloads/tensorman/examples:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu-py3" "python" "beginners/main.py"
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

How should I resolve this? Thanks.

drscotthawley commented 4 years ago

Just some extra checking to make sure everything is installed...
SPOILER: at the end of this it works.!


~/Downloads/tensorman/examples$ sudo apt update
Ign:1 http://linux.dropbox.com/ubuntu cosmic InRelease
Get:2 http://linux.dropbox.com/ubuntu cosmic Release [6,600 B]                                                        
Hit:3 http://us.archive.ubuntu.com/ubuntu eoan InRelease                                                              
Hit:4 https://brave-browser-apt-release.s3.brave.com stable InRelease                                                 
Hit:5 http://packages.microsoft.com/repos/vscode stable InRelease                                                     
Hit:6 http://apt.pop-os.org/proprietary eoan InRelease                                                                
Get:7 https://typora.io/linux ./ InRelease [793 B]                                                                    
Hit:8 http://us.archive.ubuntu.com/ubuntu eoan-security InRelease                                                     
Hit:9 http://ppa.launchpad.net/system76-dev/stable/ubuntu eoan InRelease                                              
Hit:10 http://us.archive.ubuntu.com/ubuntu eoan-updates InRelease                                                     
Hit:11 http://us.archive.ubuntu.com/ubuntu eoan-backports InRelease                                                   
Hit:13 http://ppa.launchpad.net/system76/pop/ubuntu eoan InRelease                                                    
Hit:12 https://packagecloud.io/slacktechnologies/slack/debian jessie InRelease
Fetched 7,393 B in 3s (2,271 B/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
All packages are up to date.

~/Downloads/tensorman/examples$ sudo apt upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

~/Downloads/tensorman/examples$ sudo apt install tensorman
Reading package lists... Done
Building dependency tree       
Reading state information... Done
tensorman is already the newest version (0.1.0~1576694038~19.10~b8da778).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

~/Downloads/tensorman/examples$ sudo apt install nvidia-container-runtime
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-container-runtime is already the newest version (3.1.4-0pop1~1569270714~19.10~2ea45f8).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
~/Downloads/tensorman/examples$ sudo usermod -aG docker $USER
~/Downloads/tensorman/examples$ systemctl restart docker

~/Downloads/tensorman/examples$ tensorman run --gpu --python3 python -- beginners/main.py
....

....Ok now it works.  Odd. 
Closing.