triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.32k stars 1.48k forks source link

Prebuilt Triton Server 24.05-trtllm-python-py3 does not have correct TensorRT version #7374

Open CarterYancey opened 4 months ago

CarterYancey commented 4 months ago

Description According to the Framework matrix (https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html#framework-matrix-2024), 24.05 is supposed to support TensorRT 10.0.6.1. The 24.05-py3 image does, but the prebuilt image with tensorrtllm_backend support does not.

Triton Information What version of Triton are you using? 24.05-trtllm-python-py3

Are you using the Triton container or did you build it yourself? Prebuilt.

To Reproduce docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 bash

Within the container:

`dpkg -la | grep -i tensorrt`
ii  libnvinfer-bin                  8.6.3.1-1+cuda12.0                          amd64        TensorRT binaries
ii  libnvinfer-dev                  8.6.3.1-1+cuda12.0                          amd64        TensorRT development libraries
ii  libnvinfer-dispatch-dev         8.6.3.1-1+cuda12.0                          amd64        TensorRT development dispatch runtime libraries
ii  libnvinfer-dispatch8            8.6.3.1-1+cuda12.0                          amd64        TensorRT dispatch runtime library
ii  libnvinfer-headers-dev          8.6.3.1-1+cuda12.0                          amd64        TensorRT development headers
ii  libnvinfer-headers-plugin-dev   8.6.3.1-1+cuda12.0                          amd64        TensorRT plugin headers
ii  libnvinfer-lean-dev             8.6.3.1-1+cuda12.0                          amd64        TensorRT lean runtime libraries
ii  libnvinfer-lean8                8.6.3.1-1+cuda12.0                          amd64        TensorRT lean runtime library
ii  libnvinfer-plugin-dev           8.6.3.1-1+cuda12.0                          amd64        TensorRT plugin libraries
ii  libnvinfer-plugin8              8.6.3.1-1+cuda12.0                          amd64        TensorRT plugin libraries
ii  libnvinfer-vc-plugin-dev        8.6.3.1-1+cuda12.0                          amd64        TensorRT vc-plugin library
ii  libnvinfer-vc-plugin8           8.6.3.1-1+cuda12.0                          amd64        TensorRT vc-plugin library
ii  libnvinfer8                     8.6.3.1-1+cuda12.0                          amd64        TensorRT runtime libraries
ii  libnvonnxparsers-dev            8.6.3.1-1+cuda12.0                          amd64        TensorRT ONNX libraries
ii  libnvonnxparsers8               8.6.3.1-1+cuda12.0                          amd64        TensorRT ONNX libraries
ii  libnvparsers-dev                8.6.3.1-1+cuda12.0                          amd64        TensorRT parsers libraries
ii  libnvparsers8                   8.6.3.1-1+cuda12.0                          amd64        TensorRT parsers libraries

Compare this to 24.05-py3: sudo docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /tensorrtllm_backend/:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.05-py3 bash

Within the container:

dpkg -la | grep -i tensorrt
ii  libnvinfer-bin                  10.0.1.6-1+cuda12.4                     amd64        TensorRT binaries
ii  libnvinfer-dev                  10.0.1.6-1+cuda12.4                     amd64        TensorRT development libraries
ii  libnvinfer-dispatch-dev         10.0.1.6-1+cuda12.4                     amd64        TensorRT development dispatch runtime libraries
ii  libnvinfer-dispatch10           10.0.1.6-1+cuda12.4                     amd64        TensorRT dispatch runtime library
ii  libnvinfer-headers-dev          10.0.1.6-1+cuda12.4                     amd64        TensorRT development headers
ii  libnvinfer-headers-plugin-dev   10.0.1.6-1+cuda12.4                     amd64        TensorRT plugin headers
ii  libnvinfer-lean-dev             10.0.1.6-1+cuda12.4                     amd64        TensorRT lean runtime libraries
ii  libnvinfer-lean10               10.0.1.6-1+cuda12.4                     amd64        TensorRT lean runtime library
ii  libnvinfer-plugin-dev           10.0.1.6-1+cuda12.4                     amd64        TensorRT plugin libraries
ii  libnvinfer-plugin10             10.0.1.6-1+cuda12.4                     amd64        TensorRT plugin libraries
ii  libnvinfer-vc-plugin-dev        10.0.1.6-1+cuda12.4                     amd64        TensorRT vc-plugin library
ii  libnvinfer-vc-plugin10          10.0.1.6-1+cuda12.4                     amd64        TensorRT vc-plugin library
ii  libnvinfer10                    10.0.1.6-1+cuda12.4                     amd64        TensorRT runtime libraries
ii  libnvonnxparsers-dev            10.0.1.6-1+cuda12.4                     amd64        TensorRT ONNX libraries
ii  libnvonnxparsers10              10.0.1.6-1+cuda12.4                     amd64        TensorRT ONNX libraries
ii  tensorrt-dev                    10.0.1.6-1+cuda12.4                     amd64        Meta package for TensorRT development libraries

Expected behavior 24.05-trtllm-python-py3 should have the correct version of TensorRT.

forwardmeasure commented 4 months ago

I've run into this myself. I'm attempting to deploy LLama3 and Gemma, and keep running into these issues when generating the engines.

Will there be an update/fix any time soon?

Thanks!

geraldstanje commented 4 months ago

hi, i have same issue - nvidia team can you reply?

@forwardmeasure did you check if it works with nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 ?

forwardmeasure commented 4 months ago

@geraldstanje - I JUST got gemma to deploy (yet to test), after much trial and error. Getting to the correct version number combination for Triton, TensorRT, TensorRT-LLM, and tensorrt_llm_backend involved way too much trial-and-error for my liking, but I am glad I'm over this hump . If the process so far is any indicator, I expect there will be a ton more before I am running inference properly ;-).

For what it's worth, I am attaching a tar file containing Dockerfiles and build scripts that worked for me. You need to plug in your own Huggingface token (for downloading weights), and Docker registry details. I hope it helps.

tensorrt-llm-models-build.tar.gz

CarterYancey commented 4 months ago

Getting to the correct version number combination for Triton, TensorRT, TensorRT-LLM, and tensorrt_llm_backend involved way too much trial-and-error for my liking, but I am glad I'm over this hump . If the process so far is any indicator, I expect there will be a ton more before I am running inference properly ;-).

If it makes you feel better, I also spent several hours trying to reach inference, mostly thanks to version mismatches. This is especially complicated when the documentation claims to support versions which are not actually present on the containers. I was able to host a compiled Mistral Model on a Triton Server by using the v0.8.0 branch of tensorrtllm_backend (which has a sub-repo tensorrt-llm linked to v0.8.0 tag) and nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3. I recommend doing all of your model compiling and deploying on the same container, to ensure versions stay consistent. I also recommend editing requirements.txt files to install exact versions whenever they don't already. Using both dpkg -la and pip list to verify that versions are correct is also helpful, rather than assuming that the proper versions of things have been installed if you are using prebuilt containers or running pip install -r blindly. Lastly, whenever you are searching the forums for help and someone links to some documentation on git, always make sure you are looking at a version of that documentation dated to the reference comment (and that you are using all the same versions mentioned in the comment) else you are likely to follow the wrong procedures.

forwardmeasure commented 4 months ago

Thanks, @CarterYancey - good to know I wasn't the only one suffering. I did make sure the versions were the same. As you pointed out, the documentation is not accurate, and that caused me a ton of pain. I am happy to report that I am able to deploy Gemma and LLama3-70b correctly. I am able to download the weights from HuggingFace, convert them (using the convert scripts provided with each example), generate the engines, and finally serve them.

The combination that works for me is: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 + tensorrtllm_backend V0.10.0. The image was initially incorrect, as it had the wrong Tensort-LLM version. That seems to have been fixed as of June 30, and actually uses v0.10.0 of tensorrtllm_backend, as stated in the documentation. The version of nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 was pegged to V0.9.0, contrary to what the documentation said.

As a side note, for Llama3-70b, I needed to use 4 A100-40GB GPUs, since there's a scarcity of A100-80GB GPUs on GCP. To run it, I had to correctly set the world size to 4.

Next up - rebuild on cheaper L4/T4 GPUs to better understand the price/performance trade-offs.

Cheers.

bhavin192 commented 3 months ago

After spending almost 5 to 6 hours the following combination worked for me: TensorRT-LLM version 0.9.0 and nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

Though nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 has TensorRT-LLM 0.10.0 and TensorRT 10.0.1.6, it was still failing with this when used with TesorRT-LLM 0.10.0:

[TensorRT-LLM][ERROR] 1: [stdArchiveReader.cpp::stdArchiveReaderInitCommon::47] Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 237, Serialized Engine Version: 238)

Though the known issues section in release notes of 24.06 says it has 0.9.0 of TensorRT-LLM, very much confusing :smile: https://github.com/triton-inference-server/server/releases/tag/v2.47.0

Created a discussion here with more details on what I was trying to do, but didn't get any response yet: https://github.com/triton-inference-server/server/discussions/7439

rmccorm4 commented 3 months ago

Hi everyone, thanks for raising issue and sharing all of the details and debugging so far. @krishung5 can you help to explain the subtle differences between TRT-LLM and TensorRT version requirements in each release, and sanity check our documented versions are correct for the mentioned releases?

krishung5 commented 3 months ago

Thanks everyone for sharing all the details. We are doing our best to improve the Triton TRT-LLM container, so any feedback is appreciated!

Regarding

Though the known issues section in release notes of 24.06 says it has 0.9.0 of TensorRT-LLM, very much confusing 😄

We have fixed the version in the release note. For 24.06, it should be v0.10.0 not 0.9.0. Thanks for the catch!

Regarding the original question about incorrect TRT version for 24.05 TRT-LLM container, seems like those files with TRT 8.X were not cleaned up properly. The container does have the correct version installed, which should be TRT 9.3.0 for 24.05 TRT-LLM container:

root@0911136980f1:/usr/local/tensorrt# ls -l lib/
total 1676324
lrwxrwxrwx 1 root root         19 Jan 29 22:45 libnvinfer.so -> libnvinfer.so.9.3.0
lrwxrwxrwx 1 root root         19 Jan 29 22:45 libnvinfer.so.9 -> libnvinfer.so.9.3.0
-rwxr-xr-x 1 root root  252847768 Jan 29 22:45 libnvinfer.so.9.3.0
-rwxr-xr-x 1 root root 1382065736 Jan 29 22:46 libnvinfer_builder_resource.so.9.3.0
lrwxrwxrwx 1 root root         28 Jan 29 22:43 libnvinfer_dispatch.so -> libnvinfer_dispatch.so.9.3.0
lrwxrwxrwx 1 root root         28 Jan 29 22:43 libnvinfer_dispatch.so.9 -> libnvinfer_dispatch.so.9.3.0

The TRT installed under /usr/local/tensorrt is the version that Triton and TRT-LLM backend use. We have fixed this and cleaned up files in the later release.

The Triton TRT-LLM container is built from a different base image than the other Triton containers. This is to align with the dependencies that TensorRT-LLM supports. For example, the 24.07 TRT-LLM container actually uses the dependency stack of 24.05. TensorRT-LLM also requires a specific TRT version, so even when checking the support matrix for 24.05, there might still be some discrepancies.

In the latest release (24.07), we started including a Triton TRT-LLM Container Support Matrix section, and we hope that can help with the confusion regarding versioning.

The TRT-LLM engines and the TRT-LLM backend are tightly coupled. The TRT-LLM version used to build the engines must be the same as the one in the container. We started to include the TRT-LLM package as part of the TRT-LLM container to ease that pain point, and we recommend users build the TRT-LLM engines and run the models directly within the container to avoid any dependency issues.

bhavin192 commented 3 months ago

Thank you Kris for the explanation! I observed the change on 24.07 changelog a few days back, this makes it prominent and is really helpful :100:

PS: I tagged you in a related discussion https://github.com/triton-inference-server/server/discussions/7439#discussioncomment-10208410 if you could help with my doubts there, that would also help others stumbling upon similar issues.