Closed dt-subaandh-krishnakumar closed 7 months ago
So the logic for whether to use the ONNX CUDA environment is controlled here
https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L80-L84
Both on CPU and GPU our tests were passing so I don't suspect there's a bug with how this is set
What's confusing me about your setup is using both onnx gpu runtime and torch cpu?
[W:onnxruntime:Default, onnxruntime_pybind_state.cc:578 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
It seems to me like what's happening is in your handler your map_location
is cuda so make sure that's not not the case https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L112
Are you using the same dependencies for onnx and onnx runtime when measuring the extra memory overhead? torch 2.0 in general has way more dependencies but the overhead you're seeing is significant
I suspect you should be able to repro your errors without torchserve in the loop which will make debugging this a bit easier. Let me know if this all makes sense
Hello @msaroufim ,
Following your comments, I simplified the requirements and the Docker file to make sure there's nothing wrong with my setup.
I don't understand why I should set the map_location=None
when I want my model to use CUDA.
Summary: With the updated setup I still have the error. (Failed to create CUDAExecutionProvider)
My test setup is available here: https://github.com/dt-subaandh-krishnakumar/pytorch_issue. I attached the Logs, Dockerfiles, GPU info. This issue occurs in all onnx models (This is the one used for testing https://huggingface.co/docs/transformers/serialization).
During my test I observed the following libraries are missing in torchserve 0.8.1 docker image.
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
I believe if the Bug 1 is fixed this won't be a problem as I don't need to install torch 2.0 which has lot of other dependencies.
Please let me know if you need any more information.
Possible Solution During my test I observed that in torchserve 0.8.1 the following libraries are missing in torchserve 0.8.1 docker image.
This is interesting, tagging @agunapal since a similar issue came up with deepspeed - was not aware ONNX depends on all of this. In that case you can check if your issue goes away if you build a new docker image with the nvidia runtime like so https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L6C3-L6C143
Possible Solution During my test I observed that in torchserve 0.8.1 the following libraries are missing in torchserve 0.8.1 docker image.
This is interesting, tagging @agunapal since a similar issue came up with deepspeed - was not aware ONNX depends on all of this. In that case you can check if your issue goes away if you build a new docker image with the nvidia runtime like so https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L6C3-L6C143
There were some errors with this https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L6C3-L6C143
In my case, this worked
docker build --file Dockerfile --build-arg BASE_IMAGE=nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04 --build-arg PYTHON_VERSION=3.8 -t torchserve:0.8.1 .
but I have a new issue the metrics API returns nothing. Any idea why?
File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
func = self.__getitem__(name)
File "/usr/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/local/nvidia/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ts/metrics/metric_collector.py", line 27, in <module>
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 75, in <listcomp>
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 19, in device_status
nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
2023-06-27T14:49:12,149 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
func = self.__getitem__(name)
File "/usr/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/local/nvidia/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ts/metrics/metric_collector.py", line 27, in <module>
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 75, in <listcomp>
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 19, in device_status
nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
Seems like an NVIDIA driver issue - see this for example https://github.com/NVIDIA/k8s-device-plugin/issues/331
Try updating this line https://github.com/pytorch/serve/blob/master/docker/build_image.sh#L46 to nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04
and then run the build_image.sh script
Seems like an NVIDIA driver issue - see this for example NVIDIA/k8s-device-plugin#331
Try updating this line https://github.com/pytorch/serve/blob/master/docker/build_image.sh#L46 to
nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04
and then run the build_image.sh script
I updated build_image.sh
and rebuilt the image (./build_image.sh -g
). I'm able to initialize the models but the Metrics API isn't working(curl http://127.0.0.1:8082/metrics
) and returns an empty response.
Thanks @dt-subaandh-krishnakumar I believe that sounds like a separate issue - tagging @namannandan who owns this
Might make sense to open a seperate issue for this though so we don't lose this
Fixed in this PR https://github.com/pytorch/serve/pull/2435
I faced this issue, you need to install CUDA 11.8 and corresponding torch version with CUDA 11.8.
š Describe the bug
I recently updated the torchserve version from 0.7.1-gpu to 0.8.1-gpu.
Current setup
I used
torchserve:0.7.1-gpu
from the source and build a docker image withtorch2.0+cpu
. The onnx GPU models and were running and the models used ~8.5 Memory and ~4GB GPU (cuda 11.7).Bug
torchserve 0.8.1
thetorch 2.0+cpu
no longer worked and failed with the following error:torch2.0
with gpu dependencies but doing so increased the Memory(~13GB ) and GPU (~6GB) consumption.The models were not updated. I built torchserve0.8.1 with
./build_image.sh -py 3.8 -cv cu117 -g -t torchserve_py38
Error logs
[W:onnxruntime:Default, onnxruntime_pybind_state.cc:578 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
Installation instructions
Yes I ran ./build_image.sh -py 3.8 -cv cu117 -g -t torchserve_py38
Model Packaing
I converted pytorch models to onnx and served them using Custom Handler.
config.properties
No response
Versions
Working
Bug
Repro instructions
Issue 1
python3.8
,onnxruntime-gpu==1.13.1
,torchserve0.7.1-gpu
andtorch2.0.0+cpu
. (Take note of GPU, Memory consumption)./build_image.sh -py 3.8 -cv cu117 -g -t torchserve_py38
. Run the same model with (onnxruntime-gpu==1.13.1
andtorch2.0.0+cpu
).Issue 2
torch2.0
instead oftorch 2.0+cpu
the memory and GPU consumption will increase.Possible Solution
No response