Open erikhuck opened 8 months ago
See this: https://github.com/wookayin/gpustat/issues/161#issuecomment-1784007533. NVIDIA Driver 535.43~86 are broken and it will report a wrong process information.
Also, this is not the right place for the pynvml
package. I recommend you use the official bindings nvidia-ml-py.
Description
When using
nvmlDeviceGetComputeRunningProcesses
to get theusedGpuMemory
of all the processes using a particular GPU (in this case, GPU 0), I saw that erroneous results appeared to be reported. When compared withnvidia-smi
in the terminal, theusedGpuMemory
contained the value of the process ID while thepid
field, rather than containing the process ID, contained the used GPU memory. So the values were swapped. Sometimes other fields in the process object contained the process ID or GPU memory values, making the field values of the process objects output benvmlDeviceGetComputeRunningProcesses
overall shuffled. Investigation is warranted to ensurenvmlDeviceGetComputeRunningProcesses
consistently provides correct output.Code for reproducing the bug
Environment
torch==2.0.1 pynvml=11.5.0 CUDA version: 12.2 GPU Model: NVIDIA GeForce RTX 4080 Driver Version: 535.54.03