triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
746 stars 51 forks source link

[Question] Tensor parallelism for tensorrt_llm #79

Open JoeLiu996 opened 5 months ago

JoeLiu996 commented 5 months ago

Is your feature request related to a problem? Please describe. I am aware that PyTriton already have an example for using PyTriton with tensorrt_llm. But I noticed that the example only support single gpu inference. Therefore, may I ask is there any other examples or reference docs which using tensorrt_llm with PyTriton and support tensor parallelism.

Describe the solution you'd like I think right now the example is excellent, but will be more comprehensive if can add multiple gpu inference(tensor parallelism inference) examples since this will be one of the widely use case.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.