GPT Engine Builder - Githubissues

Goal: To add automatic TRT LLM engine building (for the hf:gpt)

Steps:

docker pull nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3
Run image and either clone triton_cli in the container or mount it to the container
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com/ tensorrt-llm==0.7.0
cd triton_cli && pip install .
triton repo add -m gpt --source hf:gpt2 --backend tensorrtllm
triton server start

Notes:

The engine builder was created from merging all the TRT LLM modules used to build the gpt engine. The file can likely be trimmed significantly since we are using hard-coded values, but since we are probably going to move to optimum in the near future and delete the builder module, it's probably not worth the time to trim.
(IFB models only) The server should launch successfully, but attempts to query the server will currently fail due to this issue reported on git and in our slack channels. I'll investigate this more over the coming days.

Current status:

IFB not working with either 0.7.0 or 0.7.1
- input_ids error
  - has anyone acknowledged this? WAR or fix?
v1 is a WAR that works with both (?) 0.7.0 and 0.7.1
- Triton 23.12 corresponds to TRT LLM 0.7.0
- No expectation that 0.7.1 should work, so not worried about that.
Are we (or TRT LLM) not testing Triton 23.12 + 0.7.0 TRT LLM in CI somewhere?
- need to characterize if /generate endpoint specific or not

triton-inference-server / triton_cli