I'm encountering an issue when trying to start a container using the pytorch/torchserve:0.12.0-gpu image. The container starts but then fails to collect system metrics, specifically related to GPU utilization.
In the actual inference operation of the model, only the CPU can be used rather than the GPU.
Error logs
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-10-19T06:43:12,922 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": error=2, No such file or directory
2024-10-19T06:43:12,928 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-10-19T06:43:12,941 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-10-19T06:43:12,999 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2024-10-19T06:43:13,194 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.12.0
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 2
Number of CPUs: 48
Max heap size: 30688 M
Python executable: /home/venv/bin/python
Config file: /home/model-server/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://0.0.0.0:8082
Model Store: /home/model-server/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.|http(s)?://.]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /home/model-server/wf-store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: true
2024-10-19T06:43:13,209 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2024-10-19T06:43:13,233 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-10-19T06:43:13,293 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2024-10-19T06:43:13,294 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-10-19T06:43:13,296 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2024-10-19T06:43:14,224 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.9/ctypes/init.py", line 387, in getattr
func = self.getitem(name)
File "/usr/lib/python3.9/ctypes/init.py", line 392, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 19, in device_status
nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
🐛 Describe the bug
I'm encountering an issue when trying to start a container using the pytorch/torchserve:0.12.0-gpu image. The container starts but then fails to collect system metrics, specifically related to GPU utilization. In the actual inference operation of the model, only the CPU can be used rather than the GPU.
Error logs
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 2024-10-19T06:43:12,922 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": error=2, No such file or directory 2024-10-19T06:43:12,928 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties 2024-10-19T06:43:12,941 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager... 2024-10-19T06:43:12,999 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml 2024-10-19T06:43:13,194 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.12.0 TS Home: /home/venv/lib/python3.9/site-packages Current directory: /home/model-server Temp directory: /home/model-server/tmp Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml Number of GPUs: 2 Number of CPUs: 48 Max heap size: 30688 M Python executable: /home/venv/bin/python Config file: /home/model-server/config.properties Inference address: http://0.0.0.0:8080 Management address: http://0.0.0.0:8081 Metrics address: http://0.0.0.0:8082 Model Store: /home/model-server/model-store Initial Models: N/A Log dir: /home/model-server/logs Metrics dir: /home/model-server/logs Netty threads: 32 Netty client threads: 0 Default workers per model: 2 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 6553500 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.|http(s)?://.] Custom python dependency for model allowed: true Enable metrics API: true Metrics mode: LOG Disable system metrics: false Workflow Store: /home/model-server/wf-store CPP log config: N/A Model config: N/A System metrics command: default Model API enabled: true 2024-10-19T06:43:13,209 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin... 2024-10-19T06:43:13,233 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2024-10-19T06:43:13,293 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080 2024-10-19T06:43:13,294 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081 2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2024-10-19T06:43:13,296 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082 Model server started. 2024-10-19T06:43:14,224 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last): File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/usr/lib/python3.9/ctypes/init.py", line 387, in getattr func = self.getitem(name) File "/usr/lib/python3.9/ctypes/init.py", line 392, in getitem func = self._FuncPtr((name_or_ordinal, self)) AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 19, in device_status
nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
Installation instructions
docker pull pytorch/torchserve:0.12.0-gpu
Model Packaging
No Packaging
config.properties
disable_token_authorization=true enable_model_api=true service_envelope=body install_py_dep_per_model=true inference_address=http://0.0.0.0:8080 management_address=http://0.0.0.0:8081 metrics_address=http://0.0.0.0:8082 grpc_inference_address=0.0.0.0 grpc_management_address=0.0.0.0 number_of_netty_threads=32 job_queue_size=1000 model_store=/home/model-server/model-store workflow_store=/home/model-server/wf-store
Versions
Docker version 26.1.3
Repro instructions
docker run --rm -it --gpus all -d -p 28380:8080 -p 28381:8081 --name torch-server-g -v ./config.properties:/home/model-server/config.properties pytorch/torchserve:0.12.0-gpu
Possible Solution
No response