Closed Riebart closed 2 years ago
@Riebart , thanks for this issue. Can you share with me the output of nvidia-smi
for each one of the GPUs?
pynvml is an external library, so it may be good to send the details of this issue to gpuopenanalytics who owns pynvml: https://github.com/gpuopenanalytics/pynvml
This problem is already reported in two issues on pynvml: here and here.
Output of nvidia-smi
on the A100:
# nvidia-smi
Mon Jul 19 20:02:06 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84 Driver Version: 460.84 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:0B:00.0 Off | Off |
| N/A 30C P0 34W / 250W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 11 0 0 | 244MiB / 4864MiB | 14 N/A | 1 0 0 0 0 |
| | 4MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Output on the RTX3080 Mobile (I'll update this comment later and include it).
Awesome. I'll track this. As this is a pynvml issue, would you be able to remove that cell form your workflow or prefer that we comment out or remove that cell and replace it with the standard !nvidia-smi
? For the intro notebooks, this is more decoration.
For our use case (training and hands-on workshops), we can comment out/remove that cell, as we're doing other automated transformations on the notebooks to change sample counts to match the size of the MIG slices we're using anyway (since we usually don't have 16GB of VRAM per participant).
It might be worth commenting it out until pynvml fixes the issue to avoid confusing new users at the very beginning of the very first intro notebook.
It might be worth commenting it out until pynvml fixes the issue to avoid confusing new users at the very beginning of the very first intro notebook.
Sorry for the late response here, but I'd like to declare a bit of a warning that the pynvml.smi
module is probably not something you should be using. The NVML bindings are effectively kept up to date by the official NVML team, but the smi bindings are not. There is currently only one person who maintains pynvml.smi
, and that is not on a regular basis. Therefore, my personal vote is actually to deprecate the module from this repository. You should be able to use NVML to get the information/metrics you need anyway. If enough community-members want to to maintain an smi python API, then it may make more sense to do this in a separate project.
@rjzamora That makes sense to me.
It's important to note that the dask-cuda
portions of Intro01 are also broken with MIG partitions, also because of pynvml related things. There's an open PR for that waiting for review, but it seems like MIG is breaking a lot of downstream projects due to what amount to namespace and scoping changes.
Thanks @rjzamora for the information. I'll refactor and update the affected notebooks. When i get the solution PRed I'll reply back and close this issue. Thanks again @Riebart !
@Riebart , can you test my PR for Intro notebooks on your GPU? https://github.com/rapidsai-community/notebooks-contrib/pull/339. I don't have either GPU to test.
@Riebart , can you test my PR for Intro notebooks on your GPU? #339. I don't have either GPU to test.
@taureandyernv Still no joy, but a different error this time. This is related to issues we've observed in other areas, such as dask-cuda
(ref and what I believe to be the related PR)
import pynvml
pynvml.nvmlInit()
gpu_mem = round(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).total/1024**3)
print("your GPU has", gpu_mem, "GB")
---------------------------------------------------------------------------
NVMLError_NoPermission Traceback (most recent call last)
<ipython-input-5-2daaad25a9ae> in <module>
2 pynvml.nvmlInit()
3
----> 4 gpu_mem = round(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).total/1024**3)
5 print("your GPU has", gpu_mem, "GB")
/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py in nvmlDeviceGetMemoryInfo(handle)
1982 fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
1983 ret = fn(handle, byref(c_memory))
-> 1984 _nvmlCheckReturn(ret)
1985 return c_memory
1986
/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
741 def _nvmlCheckReturn(ret):
742 if (ret != NVML_SUCCESS):
--> 743 raise NVMLError(ret)
744 return ret
745
NVMLError_NoPermission: Insufficient Permissions
Aww man. okay, i'll check that out Monday, unless its P0. Does this NVMLError_NoPermission
occur with the Ampere Mobile GPU or the MIG partitions? Can you try it with the Mobile if you haven't? I'll check with @pentschev about some of the subtleties of that PR that i might be missing
can you send me your environment?
Just to clarify, https://github.com/rapidsai/dask-cuda/pull/674 has been merged and MIG devices should now be supported by Dask-CUDA. However, it's still not the most user-friendly interface, the only way to enable MIG devices at this time is to specify each MIG instance to CUDA_VISIBLE_DEVICES
via their UUIDs, similar to what's show in GPU Instances doc. The UUIDs can be queried with nvidia-smi -L
.
With MIG, you can't use pynvml.nvmlDeviceGetHandleByIndex(0)
, this is the cause for NVMLError_NoPermission
. The easiest way is to get the handle by its UUID instead with pynvml.nvmlDeviceGetHandleByUUID(str.encode("MIG-GPU-...")
. From that handle, you can then query things like memory, just as you would have done with a physical GPU device handle. You can also get the handle by specifying the device and MIG instance indices with nvmlDeviceGetMigDeviceHandleByIndex(device=0, index=0)
(get handle for MIG instance index=0
from physical GPU device=0
).
Describe the bug When using
rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8
on either an RTX 3080 Mobile or A100 MIG partition and when running the intro_01 notebook, thenvmlDeviceGetBrand()
call invoked from thepynvml
library returns an known device brand.Steps/Code to reproduce bug
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8
pynvml.smi
.Expected behavior The
.DeviceQuery()
succeeds without error.Environment details (please complete the following information):
rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8
) usingnvidia-docker2
installed from official repo as runtime.