[BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned

Riebart commented 3 years ago

Describe the bug When using rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8 on either an RTX 3080 Mobile or A100 MIG partition and when running the intro_01 notebook, the nvmlDeviceGetBrand() call invoked from the pynvml library returns an known device brand.

Steps/Code to reproduce bug

docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8
Open the intro_01 notebook.
Run the cell that attempts to query the device via pynvml.smi.

Expected behavior The .DeviceQuery() succeeds without error.

Environment details (please complete the following information):

Environment location: Docker (rapidsai/rapidsai:21.06-cuda11.2-runtime-ubuntu20.04-py3.8) using nvidia-docker2 installed from official repo as runtime.
Method of RAPIDS libraries install: Docker, as per rapids.ai Get Started page.

taureandyernv commented 3 years ago

@Riebart , thanks for this issue. Can you share with me the output of nvidia-smi for each one of the GPUs?

pynvml is an external library, so it may be good to send the details of this issue to gpuopenanalytics who owns pynvml: https://github.com/gpuopenanalytics/pynvml

Riebart commented 3 years ago

This problem is already reported in two issues on pynvml: here and here.

Output of nvidia-smi on the A100:

# nvidia-smi
Mon Jul 19 20:02:06 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:0B:00.0 Off |                  Off |
| N/A   30C    P0    34W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0   11   0   0  |    244MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      4MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Output on the RTX3080 Mobile (I'll update this comment later and include it).

taureandyernv commented 3 years ago

Awesome. I'll track this. As this is a pynvml issue, would you be able to remove that cell form your workflow or prefer that we comment out or remove that cell and replace it with the standard !nvidia-smi? For the intro notebooks, this is more decoration.

Riebart commented 3 years ago

For our use case (training and hands-on workshops), we can comment out/remove that cell, as we're doing other automated transformations on the notebooks to change sample counts to match the size of the MIG slices we're using anyway (since we usually don't have 16GB of VRAM per participant).

It might be worth commenting it out until pynvml fixes the issue to avoid confusing new users at the very beginning of the very first intro notebook.

rjzamora commented 3 years ago

It might be worth commenting it out until pynvml fixes the issue to avoid confusing new users at the very beginning of the very first intro notebook.

Sorry for the late response here, but I'd like to declare a bit of a warning that the pynvml.smi module is probably not something you should be using. The NVML bindings are effectively kept up to date by the official NVML team, but the smi bindings are not. There is currently only one person who maintains pynvml.smi, and that is not on a regular basis. Therefore, my personal vote is actually to deprecate the module from this repository. You should be able to use NVML to get the information/metrics you need anyway. If enough community-members want to to maintain an smi python API, then it may make more sense to do this in a separate project.

Riebart commented 3 years ago

@rjzamora That makes sense to me.

It's important to note that the dask-cuda portions of Intro01 are also broken with MIG partitions, also because of pynvml related things. There's an open PR for that waiting for review, but it seems like MIG is breaking a lot of downstream projects due to what amount to namespace and scoping changes.

taureandyernv commented 3 years ago

Thanks @rjzamora for the information. I'll refactor and update the affected notebooks. When i get the solution PRed I'll reply back and close this issue. Thanks again @Riebart !

taureandyernv commented 3 years ago

@Riebart , can you test my PR for Intro notebooks on your GPU? https://github.com/rapidsai-community/notebooks-contrib/pull/339. I don't have either GPU to test.

Riebart commented 3 years ago

@Riebart , can you test my PR for Intro notebooks on your GPU? #339. I don't have either GPU to test.

@taureandyernv Still no joy, but a different error this time. This is related to issues we've observed in other areas, such as dask-cuda (ref and what I believe to be the related PR)

import pynvml
pynvml.nvmlInit()

gpu_mem = round(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).total/1024**3)
print("your GPU has", gpu_mem, "GB")

---------------------------------------------------------------------------

NVMLError_NoPermission                    Traceback (most recent call last)
<ipython-input-5-2daaad25a9ae> in <module>
      2 pynvml.nvmlInit()
      3 
----> 4 gpu_mem = round(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0)).total/1024**3)
      5 print("your GPU has", gpu_mem, "GB")

/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py in nvmlDeviceGetMemoryInfo(handle)
   1982     fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
   1983     ret = fn(handle, byref(c_memory))
-> 1984     _nvmlCheckReturn(ret)
   1985     return c_memory
   1986 

/opt/conda/envs/rapids/lib/python3.7/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
    741 def _nvmlCheckReturn(ret):
    742     if (ret != NVML_SUCCESS):
--> 743         raise NVMLError(ret)
    744     return ret
    745 

NVMLError_NoPermission: Insufficient Permissions

taureandyernv commented 3 years ago

Aww man. okay, i'll check that out Monday, unless its P0. Does this NVMLError_NoPermission occur with the Ampere Mobile GPU or the MIG partitions? Can you try it with the Mobile if you haven't? I'll check with @pentschev about some of the subtleties of that PR that i might be missing

can you send me your environment?

pentschev commented 3 years ago

Just to clarify, https://github.com/rapidsai/dask-cuda/pull/674 has been merged and MIG devices should now be supported by Dask-CUDA. However, it's still not the most user-friendly interface, the only way to enable MIG devices at this time is to specify each MIG instance to CUDA_VISIBLE_DEVICES via their UUIDs, similar to what's show in GPU Instances doc. The UUIDs can be queried with nvidia-smi -L.

With MIG, you can't use pynvml.nvmlDeviceGetHandleByIndex(0), this is the cause for NVMLError_NoPermission. The easiest way is to get the handle by its UUID instead with pynvml.nvmlDeviceGetHandleByUUID(str.encode("MIG-GPU-..."). From that handle, you can then query things like memory, just as you would have done with a physical GPU device handle. You can also get the handle by specifying the device and MIG instance indices with nvmlDeviceGetMigDeviceHandleByIndex(device=0, index=0) (get handle for MIG instance index=0 from physical GPU device=0).

rapidsai-community / notebooks-contrib

[BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned #338