pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.23k stars 864 forks source link

NVML_ERROR_NOT_SUPPORTED exception #1722

Closed lromor closed 2 years ago

lromor commented 2 years ago

🐛 Describe the bug

Sometimes it can occur that NVML does not support monitoring queries to specific devices. Currently this leads to failing the startup phase.

Error logs

2022-07-04T12:33:15,023 [ERROR] Thread-20 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
  File "ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 91, in collect_all
    value(num_of_gpu)
  File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 72, in gpu_utilization
    statuses = list_gpus.device_statuses()
  File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in <listcomp>
    return [device_status(device_index) for device_index in range(device_count)]
  File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 26, in device_status
    temperature = nv.nvmlDeviceGetTemperature(handle, nv.NVML_TEMPERATURE_GPU)
  File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 1956, in nvmlDeviceGetTemperature
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

Installation instructions

pytorch/torchserve:latest-gpu

Model Packaing

N/A

config.properties

No response

Versions

------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.6.0
torch-model-archiver==0.6.0

Python version: 3.6 (64-bit runtime)
Python executable: /usr/bin/python3

Versions of relevant python libraries:
future==0.18.2
numpy==1.19.5
nvgpu==0.9.0
psutil==5.9.1
requests==2.27.1
torch-model-archiver==0.6.0
torch-workflow-archiver==0.2.4
torchserve==0.6.0
wheel==0.30.0
**Warning: torch not present ..
**Warning: torchtext not present ..
**Warning: torchvision not present ..
**Warning: torchaudio not present ..

Java Version:

OS: N/A
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: N/A
CMake version: N/A

Repro instructions

run:

torchserve --start --foreground --model-store model-store/ 

Possible Solution

Deal with those exceptions.

msaroufim commented 2 years ago

Thanks for opening this, which specific devices are you referring to? Is it an older NVIDIA GPU? An AMD GPU? something els? EDIT: This seems to be a somewhat known issue https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165. We can produce a better workaround

lromor commented 2 years ago

nvidia smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100DX-40C     On   | 00000000:00:05.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is a virtual GPU. It seems that some features like temperature monitoring might not be supported for these virtual devices. See for instance page 118 of https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf.

lromor commented 2 years ago

@msaroufim If you approve for an upstream bug-fix I'd be happy to help.

lromor commented 2 years ago

@msaroufim any update on this?

msaroufim commented 2 years ago

Hi @lromor I'm not sure what the right fix is yet. It does like seem like this is problem introduced by NVIDIA pynvml.nvml.NVMLError_NotSupported: Not Supported so I believe your best best is commenting on https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165 which will give someone on the team some buffer to take a look

lromor commented 2 years ago

Hi @msaroufim , I've opened an issue here: https://forums.developer.nvidia.com/t/nvml-issue-with-virtual-a100/220718?u=lromor

lromor commented 2 years ago

In case anyone gets to a similar issue and would like to have a quick fix, I patched the code with:

diff --git a/ts/metrics/system_metrics.py b/ts/metrics/system_metrics.py
index c7aaf6a..9915c9e 100644
--- a/ts/metrics/system_metrics.py
+++ b/ts/metrics/system_metrics.py
@@ -7,6 +7,7 @@ from builtins import str
 import psutil
 from ts.metrics.dimension import Dimension
 from ts.metrics.metric import Metric
+import pynvml

 system_metrics = []
 dimension = [Dimension('Level', 'Host')]
@@ -69,7 +70,11 @@ def gpu_utilization(num_of_gpu):
         system_metrics.append(Metric('GPUMemoryUtilization', value['mem_used_percent'], 'percent', dimension_gpu))
         system_metrics.append(Metric('GPUMemoryUsed', value['mem_used'], 'MB', dimension_gpu))

-    statuses = list_gpus.device_statuses()
+    try:
+        statuses = list_gpus.device_statuses()
+    except pynvml.nvml.NVMLError_NotSupported:
+        statuses = []
+
     for idx, status in enumerate(statuses):
         dimension_gpu = [Dimension('Level', 'Host'), Dimension("device_id", idx)]
         system_metrics.append(Metric('GPUUtilization', status['utilization'], 'percent', dimension_gpu))
msaroufim commented 2 years ago

I think this is the right solution. Wanna make a PR for it? May just need to add a logging warning as well