The engine builder was created from merging all the TRT LLM modules used to build the gpt engine. The file can likely be trimmed significantly since we are using hard-coded values, but since we are probably going to move to optimum in the near future and delete the builder module, it's probably not worth the time to trim.
(IFB models only) The server should launch successfully, but attempts to query the server will currently fail due to this issue reported on git and in our slack channels. I'll investigate this more over the coming days.
Goal: To add automatic TRT LLM engine building (for the hf:gpt)
Steps:
docker pull nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3
triton_cli
in the container or mount it to the containerpip install --no-cache-dir --extra-index-url https://pypi.nvidia.com/ tensorrt-llm==0.7.0
cd triton_cli
&&pip install .
triton repo add -m gpt --source hf:gpt2 --backend tensorrtllm
triton server start
Notes:
Current status: