Closed lromor closed 2 years ago
Thanks for opening this, which specific devices are you referring to? Is it an older NVIDIA GPU? An AMD GPU? something els? EDIT: This seems to be a somewhat known issue https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165. We can produce a better workaround
nvidia smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100DX-40C On | 00000000:00:05.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This is a virtual GPU. It seems that some features like temperature monitoring might not be supported for these virtual devices. See for instance page 118 of https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf.
@msaroufim If you approve for an upstream bug-fix I'd be happy to help.
@msaroufim any update on this?
Hi @lromor I'm not sure what the right fix is yet. It does like seem like this is problem introduced by NVIDIA pynvml.nvml.NVMLError_NotSupported: Not Supported
so I believe your best best is commenting on https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165 which will give someone on the team some buffer to take a look
Hi @msaroufim , I've opened an issue here: https://forums.developer.nvidia.com/t/nvml-issue-with-virtual-a100/220718?u=lromor
In case anyone gets to a similar issue and would like to have a quick fix, I patched the code with:
diff --git a/ts/metrics/system_metrics.py b/ts/metrics/system_metrics.py
index c7aaf6a..9915c9e 100644
--- a/ts/metrics/system_metrics.py
+++ b/ts/metrics/system_metrics.py
@@ -7,6 +7,7 @@ from builtins import str
import psutil
from ts.metrics.dimension import Dimension
from ts.metrics.metric import Metric
+import pynvml
system_metrics = []
dimension = [Dimension('Level', 'Host')]
@@ -69,7 +70,11 @@ def gpu_utilization(num_of_gpu):
system_metrics.append(Metric('GPUMemoryUtilization', value['mem_used_percent'], 'percent', dimension_gpu))
system_metrics.append(Metric('GPUMemoryUsed', value['mem_used'], 'MB', dimension_gpu))
- statuses = list_gpus.device_statuses()
+ try:
+ statuses = list_gpus.device_statuses()
+ except pynvml.nvml.NVMLError_NotSupported:
+ statuses = []
+
for idx, status in enumerate(statuses):
dimension_gpu = [Dimension('Level', 'Host'), Dimension("device_id", idx)]
system_metrics.append(Metric('GPUUtilization', status['utilization'], 'percent', dimension_gpu))
I think this is the right solution. Wanna make a PR for it? May just need to add a logging warning as well
🐛 Describe the bug
Sometimes it can occur that NVML does not support monitoring queries to specific devices. Currently this leads to failing the startup phase.
Error logs
Installation instructions
pytorch/torchserve:latest-gpu
Model Packaing
N/A
config.properties
No response
Versions
Repro instructions
run:
Possible Solution
Deal with those exceptions.