osrf / rocker

A tool to run docker containers with overlays and convenient options for things like GUIs etc.
Apache License 2.0
555 stars 70 forks source link

Set default NVIDIA_DRIVER_CAPABILITIES if it's not set #182

Closed kenji-miyake closed 2 years ago

kenji-miyake commented 2 years ago

According to the documentation, the default value of NVIDIA_DRIVER_CAPABILITIES is compute,utility.

Therefore, nvidia-smi can be used for non-NVIDIA images.

$ docker run --rm -it --gpus all ubuntu:22.04 nvidia-smi
Mon Jul  4 10:54:48 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 30%   41C    P0    70W / 290W |   1859MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

However, rocker causes an error because it sets NVIDIA_DRIVER_CAPABILITIES to only graphics when it's empty. Our related issue: https://github.com/autowarefoundation/autoware/issues/2452

$ rocker --nvidia --x11 --user ubuntu:22.04
kenji@387c3886ec2d:~$ nvidia-smi
bash: nvidia-smi: command not found
kenji@85ca1235d989:~$ echo $NVIDIA_DRIVER_CAPABILITIES
graphics

This PR fixes the behavior.

$ rocker --nvidia --x11 --user ubuntu:22.04
kenji@c6979daf4bad:~$ nvidia-smi
Mon Jul  4 20:01:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 30%   41C    P0    70W / 290W |   1874MiB /  8192MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
kenji@c6979daf4bad:~$ echo $NVIDIA_DRIVER_CAPABILITIES
compute,utility,graphics
kenji-miyake commented 2 years ago

@tfoote Hello, thank you for developing this tool. Could you take a look at this PR? :pray:

kenji-miyake commented 2 years ago

Alternatively, I had a thought that maybe we should just pass all by default? It's going to load more of the driver, but I don't see that has having a significant downside, and the user could still reduce the scope by setting it manually.

@tfoote Although I'm not so familiar with CUDA specs, I personally specifying all is acceptable considering the usage of rocker. But in that case, we should set all only when it's empty. I mean, for example, compute,utility,all is not valid.

$ docker run --rm -it --gpus all -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics ubuntu:22.04
root@f4f80c3232f8:/# exit

$ docker run --rm -it --gpus all -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,all ubuntu:22.04
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
unsupported capabilities found in 'compute,utility,all' (allowed 'compute,utility'): unknown.
tfoote commented 2 years ago

Yeah, I think we can then simplify the logic to just set all if it's not previously set. Otherwise it will get whatever is set in the environment.