Open michaelthreet opened 2 months ago
Hi @michaelthreet 👋
Very good questions. And indeed we haven't yet documented that well how the new backend design works. Basically the best guide is looking at the info in the dockerfile.
But I'll loop in @mfuntowicz, he can better show you in the right direction and point what are the system requirements 👍
Hi @michaelthreet - thanks for your interest in the TRTLLM backend.
The overall backend is pretty new and might suffer from edge cases not being handled but it should be usable. I would advise to move to this branch which refactor the backend to avoid all the locks and improve the overall throughput (significantly)
➡️ https://github.com/huggingface/text-generation-inference/pull/2357
As I mentioned, the overall backend is still WIP and I would not qualify it as "stable" so we do not offer prebuild images yet. Still, it should be fairly easy to build the docker container locally from the TGI repository:
docker build -t huggingface/text-generation-inference-trtllm:v2.1.1 -f backends/trtllm/Dockerfile .
Let us know if you encounter any issues for building 😊.
Finally, when you've got the container ready, you should be able to deploy it using the following:
docker run --gpus all --shm-size=16gb -v <host/path/to/engines/folder>:/repository --tokenizer-name <model_id_or_path> /repository
Please let us know if you encounter any blocker, more than happy to help and get your feedback
Thanks @mfuntowicz, that's all great info! I was able to build the image and run it, although with a modified command to account for the required args. I have a directory within the engine directory that contains the tokenizer, hence using /repository/tokenizer
for the --tokenizer-name
arg:
docker run --gpus all --shm-size=16gb -v </host/path/to/engines/folder>:/repository huggingface/text-generation-inference-trtllm:v2.1.1 --tokenizer-name /repository/tokenizer --model-id /repository --executor-worker /usr/local/tgi/bin/executorWorker
I'm seeing this error, however, and I'm assuming it's due to a mismatch in the TRT-LLM version the engine was compiled with and the version running in this TGI image.
[2024-08-29 16:11:07.548] [info] [ffi.cpp:75] Creating TensorRT-LLM Backend
[2024-08-29 16:11:07.548] [info] [backend.cpp:11] Initializing Backend...
[2024-08-29 16:11:07.631] [info] [backend.cpp:15] Backend Executor Version: 0.12.0.dev2024073000
[2024-08-29 16:11:07.631] [info] [backend.cpp:18] Detected 4 Nvidia GPU(s)
[2024-08-29 16:11:07.639] [info] [hardware.h:38] Detected sm_90 compute capabilities
[2024-08-29 16:11:07.639] [info] [backend.cpp:33] Detected single engine deployment, using leader mode
[TensorRT-LLM][INFO] Engine version 0.11.1.dev20240720 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 512
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 131072
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 1024
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071 = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 9996 MiB
[TensorRT-LLM][ERROR] IRuntime::deserializeCudaEngine: Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 238, Serialized Engine Version: 237)
Error: Runtime("[TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/usr/src/text-generation-inference/target/release/build/text-generation-backends-trtllm-27bca2115f4a55c3/out/build/_deps/trtllm-src/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)")
Is there a recommended TRT-LLM version? Or a way to make it compatible?
Awesome to hear it build successfully and cool you were able to figure out the required adaptations 😍.
Effectively, TensorRT-LLM engines are necessary not compatible from one release to another 🤐
You can find the exact TRTLLM version we are building against here: https://github.com/huggingface/text-generation-inference/blob/main/backends/trtllm/cmake/trtllm.cmake#L26 - we should more clearly document this and potentially give a warning if a discrepency is detected when loading the engine to better inform the user - adding to my todo.
The commit a681853d3803ee5893307e812530b5e7004bb6e1 might correspond to TRTLLM 0.12.0.dev2024073000
if I'm not mistaken
Please let me know if you need any additional follow up
I was able to get it to load the model by building a TensorRT-LLM model (Llama 3.1 8B Instruct for reference) using that matched TRTLLM version (0.12.0.dev2024073000
) and the TRLLLM llama example.
When I send requests to the /generate
endpoint, however, I'm getting some odd behavior. For example:
curl -X 'POST' \
'http://localhost:3000/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"inputs": "My name is Olivier and I"
}'
{
"generated_text": " them"
}
Argh, interesting... I'm developping with the same model and haven't got this output
Anyway going to dig tomorrow morning and will report here, sorry for the inconvenience @michaelthreet
No worries! If you could share the model you're using (or commands you used to convert it) that might help as well. It could be that I missed a flag/parameter in the conversion process.
Some (hopefully useful) followup: It looks like the /generate
path is only returning the final token in the generated_text
field. The same thing also happens when using the /v1/chat/completions
path when setting the stream
parameter to false
instead of true
. I've also noticed that finish reason is always set to eos_token
even when hitting the max_new_tokens
limit. Some examples below:
/generate
with details: true
and then details:false
curl -X 'POST' \
'http://localhost:3000/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"inputs": "Why is the sky blue?",
"parameters": {
"details": true,
"max_new_tokens": 5,
"temperature": 0.01
}
}'
{
"generated_text": " that",
"details": {
"finish_reason": "eos_token",
"generated_tokens": 1,
"seed": null,
"prefill": [],
"tokens": [
{
"id": 1115,
"text": " This",
"logprob": -3.9258082,
"special": false
},
{
"id": 374,
"text": " is",
"logprob": -3.9258082,
"special": false
},
{
"id": 264,
"text": " a",
"logprob": -3.9258082,
"special": false
},
{
"id": 3488,
"text": " question",
"logprob": -3.9258082,
"special": false
},
{
"id": 430,
"text": " that",
"logprob": -3.9258082,
"special": false
}
]
}
}
curl -X 'POST' \
'http://localhost:3000/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"inputs": "Why is the sky blue?",
"parameters": {
"details": false,
"max_new_tokens": 5,
"temperature": 0.01
}
}'
{
"generated_text": " that"
}
/v1/chat/completions
with stream: true
and then stream: false
curl -X 'POST' \
'http://localhost:3000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"logprobs": false,
"max_tokens": 5,
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"model": "tgi",
"stop": null,
"temperature": 0.01,
"stream": true
}'
data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"logprobs":null,"finish_reason":null}]}
data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" sky"},"logprobs":null,"finish_reason":null}]}
data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" appears"},"logprobs":null,"finish_reason":null}]}
data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" blue"},"logprobs":null,"finish_reason":null}]}
data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" because"},"logprobs":null,"finish_reason":"eos_token"}]}
data: [DONE]
curl -X 'POST' \
'http://localhost:3000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"logprobs": false,
"max_tokens": 5,
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"model": "tgi",
"stop": null,
"temperature": 0.01,
"stream": false
}'
{
"object": "chat.completion",
"id": "",
"created": 1725029551,
"model": "/repository/tokenizer",
"system_fingerprint": "2.2.1-dev0-native",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " because"
},
"logprobs": null,
"finish_reason": "eos_token"
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 1,
"total_tokens": 1
}
}
Sorry for the delay @michaelthreet, I've got sidetracked by something else.
Going to take a look tomorrow, thanks a ton for the additional inputs. Reporting here shortly 🤗
Feature request
Does any documentation exist, or would it be possible to add documentation, on how to use the TensorRT-LLM backend? #2458 makes mention that the TRT-LLM backend exists, and I can see that there's a Dockerfile for TRT-LLM, but I don't see any guides on how to build/use it.
Motivation
I would like to run TensorRT-LLM models using TGI.
Your contribution
I'm willing to test any builds/processes/pipelines that are available.