Open khelkun opened 1 year ago
Yes, you need to install cuda dependency if you want to use GPU. https://github.com/pytorch/serve#-quick-start-with-torchserve. Please try it and let us know.
Yes, you need to install cuda dependency if you want to use GPU. https://github.com/pytorch/serve#-quick-start-with-torchserve. Please try it and let us know.
@agunapal, I did:
By the way, I tried those 2 options:
- Re-install dependencies with
python ./ts_scripts/install_dependencies.py --environment=prod --cuda=cu102
- And installing CUDA 10.2 Toolkit
But I observe the same result:
Number of GPUs: 0
in the log and a slow inference of more than 300ms for the densenet_161 demo model.
Did I miss something?
@khelkun What's your version of CUDA? Please note that the nightly version of torchserve would use PyTorch 2.0, which is using CUDA 11.7. So, you need to have CUDA 11.7. Also, the version in install dependencies should be cu117
What's your version of CUDA? Please note that the nightly version of torchserve would use PyTorch 2.0, which is using CUDA 11.7. So, you need to have CUDA 11.7
@agunapal CUDA toolkit 10.2, torchserve 0.7.1, so I don't use the nightly version of torchserve.
However I finally setup things correctly apparently. The torchserve log was printing this:
2023-02-10T15:34:43,369 [INFO ] W-9000-coral_best_0.1-stdout MODEL_LOG - dynamo/inductor are not installed.
2023-02-10T15:34:43,369 [INFO ] W-9000-coral_best_0.1-stdout MODEL_LOG - For GPU please run pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117
2023-02-10T15:34:43,369 [INFO ] W-9000-coral_best_0.1-stdout MODEL_LOG - for CPU please run pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu
So I ran pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117
. This installed (among others packages) the torch 2.0.0.dev20230210+cu117
python package.
So I've just installed CUDA toolkit 11.7 (which installs the display driver 516.01). Please note that I've not re-installed torchserve dependencies with cu117
yet. Now the torchserve log prints:
Number of GPUs: 1
The python serve/ts_scripts/print_env_info.py
command output is now:
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:
torchserve==0.7.1
torch-model-archiver==0.7.1
Python version: 3.9 (64-bit runtime)
Python executable: C:\Users\3dverse\anaconda3\python.exe
Versions of relevant python libraries:
numpy==1.24.2
numpydoc==1.4.0
torch==2.0.0.dev20230210+cu117
torch-model-archiver==0.7.1
torchaudio==0.13.1
torchserve==0.7.1
torchtext==0.14.1
torchvision==0.14.1
torch==2.0.0.dev20230210+cu117
torchtext==0.14.1
torchvision==0.14.1
torchaudio==0.13.1
Java Version:
OS: Microsoft Windows Server 2019 Datacenter
GCC version: N/A
Clang version: N/A
CMake version: N/A
Is CUDA available: Yes
CUDA runtime version: 11.7.64
GPU models and configuration:
GPU 0: Tesla T4
Nvidia driver version: 516.01
cuDNN version: None
The inference output of the "kitten_small.jpg" is:
2023-02-14T10:08:05,017 [INFO ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 31
2023-02-14T10:08:05,031 [INFO ] W-9000-densenet161_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:31.25|#ModelName:densenet161,Level:Model|#hostname:GPU-EU-West,requestID:2fa258a7-85fa-4d80-8fda-cf51ee79549a,timestamp:1676369285
2023-02-14T10:08:05,031 [INFO ] W-9000-densenet161_1.0 ACCESS_LOG - /127.0.0.1:52870 "PUT /predictions/densenet161 HTTP/1.1" 200 45
2023-02-14T10:08:05,034 [INFO ] W-9000-densenet161_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:GPU-EU-West,timestamp:1676368987
2023-02-14T10:08:05,034 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.job.Job - Waiting time ns: 83200, Backend time ns: 45748100
2023-02-14T10:08:05,034 [INFO ] W-9000-densenet161_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:GPU-EU-West,timestamp:1676369285
2023-02-14T10:08:05,034 [INFO ] W-9000-densenet161_1.0 TS_METRICS - WorkerThreadTime.ms:17|#Level:Host|#hostname:GPU-EU-West,timestamp:1676369285
So PredictionTime.Milliseconds:31.25
is way faster and prooves GPU is used imho!
Last issue is the torchserve server prints this every 1 minute:
2023-02-14T10:08:52,685 [WARN ] pool-3-thread-2 org.pytorch.serve.metrics.MetricCollector - Parse metrics failed: NumExpr defaulting to 4 threads.
2023-02-14T10:08:53,076 [ERROR] Thread-7 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "C:\Users\3dverse\anaconda3\Lib\site-packages\ts\metrics\metric_collector.py", line 27, in <module>
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "C:\Users\3dverse\anaconda3\lib\site-packages\ts\metrics\system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "C:\Users\3dverse\anaconda3\lib\site-packages\ts\metrics\system_metrics.py", line 71, in gpu_utilization
info = nvgpu.gpu_info()
File "C:\Users\3dverse\anaconda3\lib\site-packages\nvgpu\__init__.py", line 15, in gpu_info
mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
File "C:\Users\3dverse\anaconda3\lib\site-packages\nvgpu\__init__.py", line 15, in <listcomp>
mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
ValueError: invalid literal for int() with base 10: '00000001:00:00.0 Off'
So I re-installed torchserve dependencies with python ./ts_scripts/install_dependencies.py --environment=prod --cuda=cu117
, but the previous exception remains. This error is not a big deal and may disappear if I properly re-install torchserve and all its python dependencies.
My guess is the issue was about the NVIDIA display driver which was old 451.82
but it's the one recommended by the Standard NC4as T4 v3 Azure documentation. However the display driver available from NVIDIA portal for CUDA 11.7 is 517.88 which is even more recent than the one installed by the CUDA 11.7 toolkit installer (display driver 516.01).
N.B: the display driver available from NVIDIA portal for CUDA 10.2 is 443.66 which is older than the one I installed in the first place following Azure recommandations (451.82).
Thanks for your help @agunapal.
@khelkun Great. Glad it worked. I am also curious about how you installed TorchServe initially.
If possible do you mind pasting the logs of the install dependencies script , python ./ts_scripts/install_dependencies.py --environment=prod --cuda=cu102
with a fresh env.
The logs you first posted don't look right. It seemed you had PyTorch with cu117 which should not happen. I am trying to figure out if there is a bug.
@agunapal I'll do a fresh install asap and get back to you.
I think there's no bug, I'm pretty sure I ran pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url
before pasting the logs (sorry about that, but because GPU was not detected I messed up around).
Actually serving the model does not work after this "dynamo/inductor" installation because it complains about incompatibility between the torch version & torchvision version. So I had to re-install torch==1.13.1, and the following log came back:
2023-02-10T15:34:43,369 [INFO ] W-9000-coral_best_0.1-stdout MODEL_LOG - dynamo/inductor are not installed.
2023-02-10T15:34:43,369 [INFO ] W-9000-coral_best_0.1-stdout MODEL_LOG - For GPU please run pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117
2023-02-10T15:34:43,369 [INFO ] W-9000-coral_best_0.1-stdout MODEL_LOG - for CPU please run pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu
š Describe the bug
First thanks for this great tool. It's my first deployment try of TorchServe and it does work on Windows Server 2019.
However the GPU seems not detected by TorchServe on Azure NVv3-series Windows Server 2019 VM. It's a Standard NC4as T4 v3. The GPU driver is correctly installed and detected by "GPU-Z":
The
nvidia-smi
output:Error logs
See the line
Number of GPUs: 0
in the output of thetorchserve --start --ncs --model-store model_store --models densenet161.mar
command:This is the inference output of the "kitten_small.jpg":
Where 296ms of inference time seems to confirm the GPU is not used.
Installation instructions
I just followed the TorchServe on Windows tutorial: "Install from binaries".
Model Packaing
It's the densenet_161 model from the "Serve Model" tutorial
config.properties
No response
Versions
python serve/ts_scripts/print_env_info.py
Repro instructions
I could probably write a step by step repro, but the issue is about the VM running TorchServe.
Still I just followed the TorchServe on Windows tutorial.
Possible Solution
I may have missed something that may not be mentionned in the Windows installation procedure.
Should I have executed
python ./ts_scripts/install_dependencies.py --environment=prod --cuda=cu102
instead ofpython ./ts_scripts/install_dependencies.py --environment=prod
? Should I have installed CUDA 10.2 for Windows first?By the way, I tried those 2 options:
python ./ts_scripts/install_dependencies.py --environment=prod --cuda=cu102
But I observe the same result:
Number of GPUs: 0
in the log and a slow inference of more than 300ms for the densenet_161 demo model.Thanks for the help and advice you could give me.