gpuopenanalytics / pynvml

Provide Python access to the NVML library for GPU diagnostics
BSD 3-Clause "New" or "Revised" License
205 stars 31 forks source link

nvmlDeviceGetName throws UnicodeDecodeError invalid start byte #53

Open jsoft88 opened 1 month ago

jsoft88 commented 1 month ago

Running the following code on WSL2 throws the error mentioned in the title:

from pynvml import *

handle = nvmlDeviceGetHandleByIndex(0)
print(nvmlDeviceGetName(handle))

Stacktrace:

File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.9/site-packages/pynvml/nvml.py", line 1744, in wrapper
    return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Whereas nvidia-smi command returns info without issues:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03              Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0  On |                  N/A |
|  0%   35C    P8             16W /  370W |     947MiB /  24576MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

If I try to decode the output of nvmlDeviceGetName using utf-16 codec, this is the string: '闸膠\uf88e肑要郸膐\uf889낑ꂀ釸膠\uf8a5ꂜ꾁駸膐\uf8a3ꂔꂀ雸膀\uf894낌ꂀ軸肐グ'

pynvml version 11.5.0

UlionTse commented 1 month ago

Same error in WSL2. @rjzamora @XuehaiPan

pynvml_error
wookayin commented 1 month ago

This repository is a wrong place. It's not where NVIDIA's pynvml lives.

mattip commented 1 month ago

This is weird. I have reproduced this with latest pynvml, latest NVidia drivers, wsl2. I get this for the c_name.value returned from the call

-> return c_name.value
(Pdb) p [x for x in c_name.value]
[248, 149, 160, 129, 142, 248, 145, 128, 129, 137, 248, 144, 144, 129, 137, 248, 145, 176, 128, 160, 248, 145, 160, 129, 165, 248, 156, 160, 129, 175, 248, 153, 144, 129, 163, 248, 145, 176, 128, 160, 248, 150, 128, 129, 148, 248, 140, 144, 128, 160, 248, 141, 160, 128, 182, 248, 136, 128, 128, 176]
(Pdb) p c_name.value
b'\xf8\x95\xa0\x81\x8e\xf8\x91\x80\x81\x89\xf8\x90\x90\x81\x89\xf8\x91\xb0\x80\xa0\xf8\x91\xa0\x81\xa5\xf8\x9c\xa0\x81\xaf\xf8\x99\x90\x81\xa3\xf8\x91\xb0\x80\xa0\xf8\x96\x80\x81\x94\xf8\x8c\x90\x80\xa0\xf8\x8d\xa0\x80\xb6\xf8\x88\x80\x80\xb0'
(Pdb) len(c_name.value)
60

Note the 5-byte pattern repeating itself, the length of the string is 60. On the host windows I get

(Pdb) [x for x in c_name.value]
[78, 86, 73, 68, 73, 65, 32, 71, 101, 70, 111, 114, 99, 101, 32, 71, 84, 88, 32, 49, 54, 54, 48, 32, 83, 85, 80, 69, 82]
(Pdb) c_name.value
b'NVIDIA GeForce GTX 1660 SUPER'
(Pdb) len(c_name.value)
29

I don't see the connection between the two results. Maybe a bug in the NVidia drivers v555.85 ?

mattip commented 1 month ago

nvidia-smi on WSL somehow gets the name right:

$ nvidia-smi
Wed May 29 11:59:40 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03              Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1660 ...    On  |   00000000:08:00.0  On |                  N/A |
| 28%   39C    P8             16W /  125W |    1945MiB /   6144MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        35      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+
mattip commented 1 month ago

This repository is a wrong place. It's not where NVIDIA's pynvml lives

Right. I can confirm this also happens in gpustat with nvidia-ml-py-12.550.52. Is there a place to get NVidia's attention?

$ python -m gpustat --debug
Error on querying NVIDIA devices. Use --debug flag to see more details.
'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Traceback (most recent call last):
  File "/tmp/venv310/lib/python3.10/site-packages/gpustat/cli.py", line 58, in print_gpustat
    gpu_stats = GPUStatCollection.new_query(debug=debug, id=id)
  File "/tmp/venv310/lib/python3.10/site-packages/gpustat/core.py", line 603, in new_query
    gpu_info = get_gpu_info(handle)
  File "/tmp/venv310/lib/python3.10/site-packages/gpustat/core.py", line 456, in get_gpu_info
    name = _decode(N.nvmlDeviceGetName(handle))
  File "/tmp/venv310/lib/python3.10/site-packages/pynvml.py", line 2094, in wrapper
    return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte
mattip commented 1 month ago

I posted to a NVidia forum https://forums.developer.nvidia.com/t/nvmldevicegetname-problem-in-wsl-on-windows/294491 but am not optimistic. The other postings there do not see much traffic.

rjzamora commented 1 month ago

Thanks all for engaging. I'll do my best to find someone who can help - Sorry for the delay.

rjzamora commented 1 month ago

Small Update: This issue has been escalated to the NVML team and the fix has been merged into the upcoming r560 driver branch. I do not believe there are plans to re-release the short-lived r555 branch.