huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.01k stars 1.06k forks source link

Guide on how to use TensorRT-LLM Backend #2466

Open michaelthreet opened 2 months ago

michaelthreet commented 2 months ago

Feature request

Does any documentation exist, or would it be possible to add documentation, on how to use the TensorRT-LLM backend? #2458 makes mention that the TRT-LLM backend exists, and I can see that there's a Dockerfile for TRT-LLM, but I don't see any guides on how to build/use it.

Motivation

I would like to run TensorRT-LLM models using TGI.

Your contribution

I'm willing to test any builds/processes/pipelines that are available.

ErikKaum commented 2 months ago

Hi @michaelthreet 👋

Very good questions. And indeed we haven't yet documented that well how the new backend design works. Basically the best guide is looking at the info in the dockerfile.

But I'll loop in @mfuntowicz, he can better show you in the right direction and point what are the system requirements 👍

mfuntowicz commented 2 months ago

Hi @michaelthreet - thanks for your interest in the TRTLLM backend.

The overall backend is pretty new and might suffer from edge cases not being handled but it should be usable. I would advise to move to this branch which refactor the backend to avoid all the locks and improve the overall throughput (significantly)

➡️ https://github.com/huggingface/text-generation-inference/pull/2357

As I mentioned, the overall backend is still WIP and I would not qualify it as "stable" so we do not offer prebuild images yet. Still, it should be fairly easy to build the docker container locally from the TGI repository:

docker build -t huggingface/text-generation-inference-trtllm:v2.1.1 -f backends/trtllm/Dockerfile .

Let us know if you encounter any issues for building 😊.

Finally, when you've got the container ready, you should be able to deploy it using the following:

docker run --gpus all --shm-size=16gb -v <host/path/to/engines/folder>:/repository --tokenizer-name <model_id_or_path> /repository

Please let us know if you encounter any blocker, more than happy to help and get your feedback

michaelthreet commented 2 months ago

Thanks @mfuntowicz, that's all great info! I was able to build the image and run it, although with a modified command to account for the required args. I have a directory within the engine directory that contains the tokenizer, hence using /repository/tokenizer for the --tokenizer-name arg:

docker run --gpus all --shm-size=16gb -v </host/path/to/engines/folder>:/repository huggingface/text-generation-inference-trtllm:v2.1.1 --tokenizer-name /repository/tokenizer --model-id /repository --executor-worker /usr/local/tgi/bin/executorWorker

I'm seeing this error, however, and I'm assuming it's due to a mismatch in the TRT-LLM version the engine was compiled with and the version running in this TGI image.

[2024-08-29 16:11:07.548] [info] [ffi.cpp:75] Creating TensorRT-LLM Backend
[2024-08-29 16:11:07.548] [info] [backend.cpp:11] Initializing Backend...
[2024-08-29 16:11:07.631] [info] [backend.cpp:15] Backend Executor Version: 0.12.0.dev2024073000
[2024-08-29 16:11:07.631] [info] [backend.cpp:18] Detected 4 Nvidia GPU(s)
[2024-08-29 16:11:07.639] [info] [hardware.h:38] Detected sm_90 compute capabilities
[2024-08-29 16:11:07.639] [info] [backend.cpp:33] Detected single engine deployment, using leader mode
[TensorRT-LLM][INFO] Engine version 0.11.1.dev20240720 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 512
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 131072
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 1024
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 9996 MiB
[TensorRT-LLM][ERROR] IRuntime::deserializeCudaEngine: Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 238, Serialized Engine Version: 237)
Error: Runtime("[TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/usr/src/text-generation-inference/target/release/build/text-generation-backends-trtllm-27bca2115f4a55c3/out/build/_deps/trtllm-src/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)")

Is there a recommended TRT-LLM version? Or a way to make it compatible?

mfuntowicz commented 2 months ago

Awesome to hear it build successfully and cool you were able to figure out the required adaptations 😍.

Effectively, TensorRT-LLM engines are necessary not compatible from one release to another 🤐

You can find the exact TRTLLM version we are building against here: https://github.com/huggingface/text-generation-inference/blob/main/backends/trtllm/cmake/trtllm.cmake#L26 - we should more clearly document this and potentially give a warning if a discrepency is detected when loading the engine to better inform the user - adding to my todo.

The commit a681853d3803ee5893307e812530b5e7004bb6e1 might correspond to TRTLLM 0.12.0.dev2024073000 if I'm not mistaken

Please let me know if you need any additional follow up

michaelthreet commented 2 months ago

I was able to get it to load the model by building a TensorRT-LLM model (Llama 3.1 8B Instruct for reference) using that matched TRTLLM version (0.12.0.dev2024073000) and the TRLLLM llama example.

When I send requests to the /generate endpoint, however, I'm getting some odd behavior. For example:

curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "My name is Olivier and I"
}'

{
  "generated_text": " them"
}
mfuntowicz commented 2 months ago

Argh, interesting... I'm developping with the same model and haven't got this output

Anyway going to dig tomorrow morning and will report here, sorry for the inconvenience @michaelthreet

michaelthreet commented 2 months ago

No worries! If you could share the model you're using (or commands you used to convert it) that might help as well. It could be that I missed a flag/parameter in the conversion process.

michaelthreet commented 2 months ago

Some (hopefully useful) followup: It looks like the /generate path is only returning the final token in the generated_text field. The same thing also happens when using the /v1/chat/completions path when setting the stream parameter to false instead of true. I've also noticed that finish reason is always set to eos_token even when hitting the max_new_tokens limit. Some examples below:

/generate with details: true and then details:false

curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "Why is the sky blue?",
  "parameters": {
    "details": true,
    "max_new_tokens": 5,
    "temperature": 0.01
  }
}'

{
  "generated_text": " that",
  "details": {
    "finish_reason": "eos_token",
    "generated_tokens": 1,
    "seed": null,
    "prefill": [],
    "tokens": [
      {
        "id": 1115,
        "text": " This",
        "logprob": -3.9258082,
        "special": false
      },
      {
        "id": 374,
        "text": " is",
        "logprob": -3.9258082,
        "special": false
      },
      {
        "id": 264,
        "text": " a",
        "logprob": -3.9258082,
        "special": false
      },
      {
        "id": 3488,
        "text": " question",
        "logprob": -3.9258082,
        "special": false
      },
      {
        "id": 430,
        "text": " that",
        "logprob": -3.9258082,
        "special": false
      }
    ]
  }
}
curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "Why is the sky blue?",
  "parameters": {
    "details": false,
    "max_new_tokens": 5,
    "temperature": 0.01
  }
}'

{
  "generated_text": " that"
}

/v1/chat/completions with stream: true and then stream: false

curl -X 'POST' \
  'http://localhost:3000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "logprobs": false,
  "max_tokens": 5,
  "messages": [
    {
      "role": "user",
      "content": "Why is the sky blue?"
    }
  ],
  "model": "tgi",
  "stop": null,
  "temperature": 0.01,
  "stream": true
}'

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"logprobs":null,"finish_reason":null}]}

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" sky"},"logprobs":null,"finish_reason":null}]}

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" appears"},"logprobs":null,"finish_reason":null}]}

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" blue"},"logprobs":null,"finish_reason":null}]}

data: {"object":"chat.completion.chunk","id":"","created":1725029518,"model":"/repository/tokenizer","system_fingerprint":"2.2.1-dev0-native","choices":[{"index":0,"delta":{"role":"assistant","content":" because"},"logprobs":null,"finish_reason":"eos_token"}]}

data: [DONE]
curl -X 'POST' \
  'http://localhost:3000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "logprobs": false,
  "max_tokens": 5,
  "messages": [
    {
      "role": "user",
      "content": "Why is the sky blue?"
    }
  ],
  "model": "tgi",
  "stop": null,
  "temperature": 0.01,
  "stream": false
}'

{
  "object": "chat.completion",
  "id": "",
  "created": 1725029551,
  "model": "/repository/tokenizer",
  "system_fingerprint": "2.2.1-dev0-native",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " because"
      },
      "logprobs": null,
      "finish_reason": "eos_token"
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 1,
    "total_tokens": 1
  }
}
mfuntowicz commented 2 months ago

Sorry for the delay @michaelthreet, I've got sidetracked by something else.

Going to take a look tomorrow, thanks a ton for the additional inputs. Reporting here shortly 🤗