[Triton 24.04] Bump TRT-LLM version to 0.9.0, add llama-3-8b, improve ergonomics of TRT-LLM engine building

rmccorm4 commented 5 months ago

Changelog

TRT-LLM 0.9.0 changes

Bump TRT-LLM version to 0.9.0
Remove gpt2 special exception
Update TRTLLM convert_checkpoint scripts
Update TRTLLM template models
Update custom Dockerfile for new TRTLLM and vLLM versions
Exclude checkpoint scripts from linters

Model Support:

Add llama-3-8b, llama-3-8b-insruct, and llama-2-7b-chat support for both vLLM and TRT-LLM

General improvements/ergonomics:

Skip convert_checkpoint.py step if converted weights already found
Check trtllm-build return code and raise on failure
Log the trtllm-build command used for debug/reprodubility
Use float16 dtype in trtllm-build by default to avoid degraded default accuracy. Will re-expose this through user-facing configs in the future.
Attempt to cleanup failed trtllm models in model repository if engine building fails, rather than leaving models in a broken state with their template fields left as templates
Call convert_checkpoint.py via subprocess to make sure weights loaded in GPU memory get cleaned up. This fixes OOM issues I was seeing locally when running trtllm-build step.

Misc:

Bump CLI version to 0.0.7dev

Known Issues:

triton import -m {llama-3-8b,llama-3-8b-instruct} --backend tensorrtllm seems to build engines fine, but there are issues with the corresponding 24.03 trtllm server image around loading the tokenizers. These issues are fixed with the upgrade to 24.04 trtllm image and v0.9.0:
24.03 error even after pip install sentencepiece and upgrading transformers recommendations:
```
Error: Couldn't instantiate the backend tokenizer from one of: ...
```
This works fine in 24.04

Examples

vLLM example:

triton import -m llama-3-8b-instruct --backend vllm
triton start --image nvcr.io/nvidia/tritonserver:24.04-vllm-python-py3
curl -X POST -s localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"text_input": "What is Computer Science?", "max_tokens": 256}' | jq

Output:

rmccormick@ced35d0-lcedt:~$ curl -X POST -s localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"text_input": "What is Computer Science?", "max_tokens": 256}' | jq
{
  "model_name": "llama-3-8b-instruct",
  "model_version": "1",
  "text_output": "What is Computer Science? - 30 Terms to Get You Started!\nComputer Science is a vast and diverse field that encompasses a wide range of topics. Here are 30 key terms to help you get started:\n1. **Algorithm**: A set of instructions written to solve a specific problem.\n2. **Programming**: Writing code to instruct a computer to perform a task.\n3. **Computer**: An electronic device that can process, store, and communicate data.\n4. **Data**: Unprocessed facts and figures.\n5. **Software**: Programs that run on a computer.\n6. **Hardware**: The physical components of a computer.\n7. **Network**: A group of interconnected devices that communicate with each other.\n8. **Database**: A collection of organized data.\n9. **Database Management System (DBMS)**: A software system that manages a database.\n10. **Programming Language**: A set of rules and instructions used to write code.\n11. **Variables**: Containers that store values.\n12. **Constants**: Unchanging values.\n13. **Control Flow**: The order in which a program executes statements.\n14. **Loops**: Repeating a sequence of statements.\n15. **Conditional Statements**: If-else statements that execute based on conditions.\n16. **Functions**: Reusable blocks"
}

TRT-LLM example:

# Use TRT-LLM container with all engine building and runtime dependencies in 24.04
docker run -ti --gpus all --network=host \
  --shm-size=1g --ulimit memlock=-1 \
  -v /tmp:/tmp \
  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
  -v ${HOME}/models:/root/models \
  nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3

# Install Triton CLI
GIT_REF="rmccormick-trtllm-0.9"
pip install git+https://github.com/triton-inference-server/triton_cli.git@${GIT_REF}

# Download weights, convert checkpoint, build engine
triton import -m llama-3-8b-instruct --backend tensorrtllm

# Serve - inside TRTLLM container already
triton start

# Infer
curl -X POST -s localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"text_input": "What is Computer Science?", "max_tokens": 256}' | jq

Output:

rmccormick@ced35d0-lcedt:~$ curl -X POST -s localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"text_input": "What is Computer Science?", "max_tokens": 256}' | jq
{
  "context_logits": 0,
  "cum_log_probs": 0,
  "generation_logits": 0,
  "model_name": "llama-3-8b-instruct",
  "model_version": "1",
  "output_log_probs": [
    0,
    ...
    0
  ],
  "text_output": "What is Computer Science? Computer Science is the study of the theory, design, and implementation of computer systems and algorithms. It is a broad field that encompasses a wide range of topics, including computer hardware, software, and programming languages. Computer Science is a rapidly evolving field that has a significant impact on many aspects of modern life, including business, education, healthcare, and entertainment.\n\nWhat are the main areas of Computer Science? The main areas of Computer Science include:\n\n1. Algorithms: The study of algorithms is a fundamental part of Computer Science. Algorithms are step-by-step procedures for solving problems or performing tasks.\n2. Computer Architecture: This area of Computer Science deals with the design and organization of computer systems, including the hardware and software components.\n3. Computer Networks: This area of Computer Science focuses on the design and implementation of computer networks, including the protocols and algorithms used to communicate between devices.\n4. Database Systems: This area of Computer Science deals with the design and implementation of database systems, including the storage, retrieval, and manipulation of data.\n5. Human-Computer Interaction: This area of Computer Science focuses on the design and implementation of user interfaces and user experiences for computer systems and applications.\n6. Machine Learning: This area of Computer Science deals with the development of algorithms and models that enable"
}

Tests

TRTLLM locally

IMAGE_KIND=TRTLLM TRTLLM_MODEL=gpt2 pytest -vvv tests/
...
==== 46 passed, 4 skipped, 1 xfailed in 50.52s ====

IMAGE_KIND=TRTLLM TRTLLM_MODEL=opt125m pytest -vvv tests/
...
==== 46 passed, 4 skipped, 1 xfailed in 79.49s (0:01:19) ====

vLLM locally


IMAGE_KIND=VLLM VLLM_MODEL=gpt2 pytest -vvv tests/
...
==== 45 passed, 5 skipped, 1 xfailed in 123.73s (0:02:03) =====

rmccorm4 commented 5 months ago

This should be ~90% of the changes. Going to do some local testing and CI runs to see what falls out of it.

fpetrini15 commented 5 months ago

From what I've read of the ergonomics PR, it seems like this PR needs to be merged first. How did you want to do this? Did you want to back out the overlapping changes between the PRs or merge them and deal with testing and merge collisions in the other PR?

rmccorm4 commented 5 months ago

@fpetrini15 I ended up pulling that PR's changes into this one over the weekend, so I closed the other one and will just use this PR.

rmccorm4 commented 5 months ago

Pipelines looking good, 5/5 passes across all CLI jobs :+1:

triton-inference-server / triton_cli