Open CarterYancey opened 3 weeks ago
I've run into this myself. I'm attempting to deploy LLama3 and Gemma, and keep running into these issues when generating the engines.
Will there be an update/fix any time soon?
Thanks!
hi, i have same issue - nvidia team can you reply?
@forwardmeasure did you check if it works with nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 ?
@geraldstanje - I JUST got gemma to deploy (yet to test), after much trial and error. Getting to the correct version number combination for Triton, TensorRT, TensorRT-LLM, and tensorrt_llm_backend involved way too much trial-and-error for my liking, but I am glad I'm over this hump . If the process so far is any indicator, I expect there will be a ton more before I am running inference properly ;-).
For what it's worth, I am attaching a tar file containing Dockerfiles and build scripts that worked for me. You need to plug in your own Huggingface token (for downloading weights), and Docker registry details. I hope it helps.
Getting to the correct version number combination for Triton, TensorRT, TensorRT-LLM, and tensorrt_llm_backend involved way too much trial-and-error for my liking, but I am glad I'm over this hump . If the process so far is any indicator, I expect there will be a ton more before I am running inference properly ;-).
If it makes you feel better, I also spent several hours trying to reach inference, mostly thanks to version mismatches. This is especially complicated when the documentation claims to support versions which are not actually present on the containers. I was able to host a compiled Mistral Model on a Triton Server by using the v0.8.0 branch of tensorrtllm_backend (which has a sub-repo tensorrt-llm linked to v0.8.0 tag) and nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3. I recommend doing all of your model compiling and deploying on the same container, to ensure versions stay consistent. I also recommend editing requirements.txt files to install exact versions whenever they don't already. Using both dpkg -la
and pip list
to verify that versions are correct is also helpful, rather than assuming that the proper versions of things have been installed if you are using prebuilt containers or running pip install -r
blindly. Lastly, whenever you are searching the forums for help and someone links to some documentation on git, always make sure you are looking at a version of that documentation dated to the reference comment (and that you are using all the same versions mentioned in the comment) else you are likely to follow the wrong procedures.
Thanks, @CarterYancey - good to know I wasn't the only one suffering. I did make sure the versions were the same. As you pointed out, the documentation is not accurate, and that caused me a ton of pain. I am happy to report that I am able to deploy Gemma and LLama3-70b correctly. I am able to download the weights from HuggingFace, convert them (using the convert scripts provided with each example), generate the engines, and finally serve them.
The combination that works for me is: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 + tensorrtllm_backend V0.10.0. The image was initially incorrect, as it had the wrong Tensort-LLM version. That seems to have been fixed as of June 30, and actually uses v0.10.0 of tensorrtllm_backend, as stated in the documentation. The version of nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 was pegged to V0.9.0, contrary to what the documentation said.
As a side note, for Llama3-70b, I needed to use 4 A100-40GB GPUs, since there's a scarcity of A100-80GB GPUs on GCP. To run it, I had to correctly set the world size to 4.
Next up - rebuild on cheaper L4/T4 GPUs to better understand the price/performance trade-offs.
Cheers.
Description According to the Framework matrix (https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html#framework-matrix-2024), 24.05 is supposed to support TensorRT 10.0.6.1. The 24.05-py3 image does, but the prebuilt image with tensorrtllm_backend support does not.
Triton Information What version of Triton are you using? 24.05-trtllm-python-py3
Are you using the Triton container or did you build it yourself? Prebuilt.
To Reproduce
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 bash
Within the container:
Compare this to 24.05-py3:
sudo docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /tensorrtllm_backend/:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.05-py3 bash
Within the container:
Expected behavior 24.05-trtllm-python-py3 should have the correct version of TensorRT.