triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
687 stars 45 forks source link

TensorRT-LLM suport? #41

Closed LouisCastricato closed 3 months ago

LouisCastricato commented 8 months ago

Is your feature request related to a problem? Please describe.

I can't seem to find any examples of how to distribute models that are built for tensorrt-llm. Is this a possible thing and I am missing documentation for it?

Describe the solution you'd like

Either improve documentation on how to utilize pytriton with tensorrt-llm or explain why such a combination is non-desirable or ill-formed

Describe alternatives you've considered

I've looked at the OPT-Jax example and have begun experimenting with using a Jax port of LLaMA 2 with that example.

pziecina-nv commented 7 months ago

Hi, Thank you for your feature request.

Current TensorRT-LLM team's guidance is to use using the TensorRT-LLM Backend to run models on Triton.

We're currently developing an example to demonstrate integrating TensorRT-LLM with PyTriton, which should help clarify this process.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 6 months ago

This issue was closed because it has been stalled for 7 days with no activity.

piotrm-nvidia commented 3 months ago

TensorRT LLM support is possible from release 0.5.0. The example was create to showcase PyTriton usage with NVIDIA TensoRT LLM.