Open sachanub opened 1 year ago
Thanks for reporting. This is interesting. I haven't tried this before. Will try it out and get back to you
Thanks @agunapal . Please let me know if you require any more information from my end.
Hi @sachanub This is an issue in Torch-tensort https://github.com/pytorch/TensorRT/issues/2319 cc: @lxning
🐛 Describe the bug
I am following this example to perform inference on TorchServe with a torch-tensorrt model: https://github.com/pytorch/serve/tree/master/examples/torch_tensorrt
I am using a custom container (adapted from an existing TorchServe container) which has the following:
I am running this example on a g5dn.24xlarge EC2 instance. It is expected that the model should be loaded on all 4 GPUs (with one worker each). Upon starting TorchServe, the model is loaded successfully and I can get the following inference output:
When I run
curl -X GET http://localhost:8081/models/res50-trt-fp16
, I get the following output:From the above output, it appears that a worker is created on each GPU, however the
memory.used
field is5 MB
for all GPUs except the one with id9003
(which hasmemory.used = 2152 MB
)Running
nvidia-smi
leads to the following output:As can be seen above, the memory-usage is 5 MB for all GPUs except GPU 0.
Also, when I send an inference request, I see the following in the model server logs:
The warning here suggests that while the inference request is sent for the worker on GPU 1, it eventually gets redirected to GPU 0, further suggesting that the model is only loaded on 1 GPU, not all 4. I request you to please investigate this issue. Thanks!
Error logs
Pasted relevant logs above.
Installation instructions
Provided relevant information above.
Model Packaing
Followed this example: https://github.com/pytorch/serve/tree/master/examples/torch_tensorrt
config.properties
No response
Versions
Repro instructions
To reproduce, please follow this example: https://github.com/pytorch/serve/tree/master/examples/torch_tensorrt
Possible Solution
No response