Open pommedeterresautee opened 6 months ago
Currently, backend only supports decoder model.
Thank you a lot @byshiue for your answer. Is encoder support planed?
Our use case for encoder models are RAG linked. Vectorization is heavy on compute at indexation time and reranking is quite heavy at inference time (depending of how many docs you rerank obviously). I guess in 2024 there are plenty of companies building a RAG.
FWIW, on A10 GPUs we got a 2.2 speedup on batch 64 / seqlen 430 (on average) compared to PyTorch FP16 in rerank (cross encoder), and, for our data, a 3.1 speedup on indexation (bi encoder setup). So TRT LLM in RAG (meaning support of encoder only models) makes lots of sense, and a direct support from the backend may help.
@pommedeterresautee Why don't you use TensorRT for embedding model instead of TensorRT-LLM
Currently, backend only supports decoder model.
@byshiue Can't we just use the chained models(ensemble) in any encoder-decoder model?, I mean the encoder's output serves as the input for the decoder, and also this applies to the the cross-attention layer as well I guess? What constraints prevent us from using the encoder-decoder model here? Thanks in advanced
@robosina It it not supported yet, instead of it cannot be supported.
Hi @byshiue are sequence classification with T5 models not supported yet?
@robosina It it not supported yet, instead of it cannot be supported.
I'd love to see this feature - is there anywhere I can track it?
Thank you a lot @byshiue for your answer. Is encoder support planed?
Our use case for encoder models are RAG linked. Vectorization is heavy on compute at indexation time and reranking is quite heavy at inference time (depending of how many docs you rerank obviously). I guess in 2024 there are plenty of companies building a RAG.
FWIW, on A10 GPUs we got a 2.2 speedup on batch 64 / seqlen 430 (on average) compared to PyTorch FP16 in rerank (cross encoder), and, for our data, a 3.1 speedup on indexation (bi encoder setup). So TRT LLM in RAG (meaning support of encoder only models) makes lots of sense, and a direct support from the backend may help.
@pommedeterresautee did you notice speed ups when comparing TensorRT-LLM vs TensorRT (from transformer-deploy
) or kernl
?
On large batches yes but we are using custom code to reach peak performance.
System Info
Who can help?
As it s not obvious if this is a doc issue or a feature request: @ncomly-nvidia @juney-nvidia
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I have compiled a Roberta model for classification (roberta more exactly) on tensorrt-llm. Accuracy is good, perf too. It follows code from example folder of tensorrt-llm repo.
If I follow receipe from https://github.com/NVIDIA/TensorRT-LLM/pull/778, triton serves the model, with expected performances.
However this PR rely on the use of tensorrt-llm package, which means using custom Python env quite slow to load, or custom image. If possible I would prefer to use vanilla image for maintenance reason.
I tried to use directly the
tensorrtllm
backend, but it crashes whatever I tried.and the /engines/model-ce/config.json contains:
However it crashes (see below).
Is it even possible to use this backend for a bert like model?
Fastertransformer dev being stopped, and TRT vanilla example of Bert deploy being 2 years old, tensorrt-llm option seems to be the most up to date for NLP models.
Expected behavior
it prints the IP and the port and it serves the model.
actual behavior
Trying to load the server produces those logs:
additional notes
N/A