pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.22k stars 863 forks source link

Error in MetricCollector when starting pytorch/torchserve:0.12.0-gpu container #3349

Open Hspix opened 1 month ago

Hspix commented 1 month ago

🐛 Describe the bug

I'm encountering an issue when trying to start a container using the pytorch/torchserve:0.12.0-gpu image. The container starts but then fails to collect system metrics, specifically related to GPU utilization. In the actual inference operation of the model, only the CPU can be used rather than the GPU.

Error logs

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 2024-10-19T06:43:12,922 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": error=2, No such file or directory 2024-10-19T06:43:12,928 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties 2024-10-19T06:43:12,941 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager... 2024-10-19T06:43:12,999 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml 2024-10-19T06:43:13,194 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.12.0 TS Home: /home/venv/lib/python3.9/site-packages Current directory: /home/model-server Temp directory: /home/model-server/tmp Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml Number of GPUs: 2 Number of CPUs: 48 Max heap size: 30688 M Python executable: /home/venv/bin/python Config file: /home/model-server/config.properties Inference address: http://0.0.0.0:8080 Management address: http://0.0.0.0:8081 Metrics address: http://0.0.0.0:8082 Model Store: /home/model-server/model-store Initial Models: N/A Log dir: /home/model-server/logs Metrics dir: /home/model-server/logs Netty threads: 32 Netty client threads: 0 Default workers per model: 2 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 6553500 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.|http(s)?://.] Custom python dependency for model allowed: true Enable metrics API: true Metrics mode: LOG Disable system metrics: false Workflow Store: /home/model-server/wf-store CPP log config: N/A Model config: N/A System metrics command: default Model API enabled: true 2024-10-19T06:43:13,209 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin... 2024-10-19T06:43:13,233 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2024-10-19T06:43:13,293 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080 2024-10-19T06:43:13,294 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081 2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2024-10-19T06:43:13,296 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082 Model server started. 2024-10-19T06:43:14,224 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last): File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/usr/lib/python3.9/ctypes/init.py", line 387, in getattr func = self.getitem(name) File "/usr/lib/python3.9/ctypes/init.py", line 392, in getitem func = self._FuncPtr((name_or_ordinal, self)) AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu) File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all value(num_of_gpu) File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization statuses = list_gpus.device_statuses() File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses return [device_status(device_index) for device_index in range(device_count)] File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in return [device_status(device_index) for device_index in range(device_count)] File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 19, in device_status nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle) File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses return nvmlDeviceGetComputeRunningProcesses_v3(handle); File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3 fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3") File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND) pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

Installation instructions

docker pull pytorch/torchserve:0.12.0-gpu

Model Packaging

No Packaging

config.properties

disable_token_authorization=true enable_model_api=true service_envelope=body install_py_dep_per_model=true inference_address=http://0.0.0.0:8080 management_address=http://0.0.0.0:8081 metrics_address=http://0.0.0.0:8082 grpc_inference_address=0.0.0.0 grpc_management_address=0.0.0.0 number_of_netty_threads=32 job_queue_size=1000 model_store=/home/model-server/model-store workflow_store=/home/model-server/wf-store

Versions

Docker version 26.1.3

Repro instructions

docker run --rm -it --gpus all -d -p 28380:8080 -p 28381:8081 --name torch-server-g -v ./config.properties:/home/model-server/config.properties pytorch/torchserve:0.12.0-gpu

Possible Solution

No response