triton-inference-server / triton_cli

Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.
48 stars 2 forks source link

[Triton 24.04] Bump TRT-LLM version to 0.9.0, add llama-3-8b, improve ergonomics of TRT-LLM engine building #54

Closed rmccorm4 closed 5 months ago

rmccorm4 commented 5 months ago

Changelog

TRT-LLM 0.9.0 changes

Model Support:

General improvements/ergonomics:

Misc:

Known Issues:

Examples

vLLM example:

triton import -m llama-3-8b-instruct --backend vllm
triton start --image nvcr.io/nvidia/tritonserver:24.04-vllm-python-py3
curl -X POST -s localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"text_input": "What is Computer Science?", "max_tokens": 256}' | jq

Output:

rmccormick@ced35d0-lcedt:~$ curl -X POST -s localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"text_input": "What is Computer Science?", "max_tokens": 256}' | jq
{
  "model_name": "llama-3-8b-instruct",
  "model_version": "1",
  "text_output": "What is Computer Science? - 30 Terms to Get You Started!\nComputer Science is a vast and diverse field that encompasses a wide range of topics. Here are 30 key terms to help you get started:\n1. **Algorithm**: A set of instructions written to solve a specific problem.\n2. **Programming**: Writing code to instruct a computer to perform a task.\n3. **Computer**: An electronic device that can process, store, and communicate data.\n4. **Data**: Unprocessed facts and figures.\n5. **Software**: Programs that run on a computer.\n6. **Hardware**: The physical components of a computer.\n7. **Network**: A group of interconnected devices that communicate with each other.\n8. **Database**: A collection of organized data.\n9. **Database Management System (DBMS)**: A software system that manages a database.\n10. **Programming Language**: A set of rules and instructions used to write code.\n11. **Variables**: Containers that store values.\n12. **Constants**: Unchanging values.\n13. **Control Flow**: The order in which a program executes statements.\n14. **Loops**: Repeating a sequence of statements.\n15. **Conditional Statements**: If-else statements that execute based on conditions.\n16. **Functions**: Reusable blocks"
}

TRT-LLM example:

# Use TRT-LLM container with all engine building and runtime dependencies in 24.04
docker run -ti --gpus all --network=host \
  --shm-size=1g --ulimit memlock=-1 \
  -v /tmp:/tmp \
  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
  -v ${HOME}/models:/root/models \
  nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3

# Install Triton CLI
GIT_REF="rmccormick-trtllm-0.9"
pip install git+https://github.com/triton-inference-server/triton_cli.git@${GIT_REF}

# Download weights, convert checkpoint, build engine
triton import -m llama-3-8b-instruct --backend tensorrtllm

# Serve - inside TRTLLM container already
triton start

# Infer
curl -X POST -s localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"text_input": "What is Computer Science?", "max_tokens": 256}' | jq

Output:

rmccormick@ced35d0-lcedt:~$ curl -X POST -s localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"text_input": "What is Computer Science?", "max_tokens": 256}' | jq
{
  "context_logits": 0,
  "cum_log_probs": 0,
  "generation_logits": 0,
  "model_name": "llama-3-8b-instruct",
  "model_version": "1",
  "output_log_probs": [
    0,
    ...
    0
  ],
  "text_output": "What is Computer Science? Computer Science is the study of the theory, design, and implementation of computer systems and algorithms. It is a broad field that encompasses a wide range of topics, including computer hardware, software, and programming languages. Computer Science is a rapidly evolving field that has a significant impact on many aspects of modern life, including business, education, healthcare, and entertainment.\n\nWhat are the main areas of Computer Science? The main areas of Computer Science include:\n\n1. Algorithms: The study of algorithms is a fundamental part of Computer Science. Algorithms are step-by-step procedures for solving problems or performing tasks.\n2. Computer Architecture: This area of Computer Science deals with the design and organization of computer systems, including the hardware and software components.\n3. Computer Networks: This area of Computer Science focuses on the design and implementation of computer networks, including the protocols and algorithms used to communicate between devices.\n4. Database Systems: This area of Computer Science deals with the design and implementation of database systems, including the storage, retrieval, and manipulation of data.\n5. Human-Computer Interaction: This area of Computer Science focuses on the design and implementation of user interfaces and user experiences for computer systems and applications.\n6. Machine Learning: This area of Computer Science deals with the development of algorithms and models that enable"
}

Tests

TRTLLM locally

IMAGE_KIND=TRTLLM TRTLLM_MODEL=gpt2 pytest -vvv tests/
...
==== 46 passed, 4 skipped, 1 xfailed in 50.52s ====
IMAGE_KIND=TRTLLM TRTLLM_MODEL=opt125m pytest -vvv tests/
...
==== 46 passed, 4 skipped, 1 xfailed in 79.49s (0:01:19) ====

vLLM locally


IMAGE_KIND=VLLM VLLM_MODEL=gpt2 pytest -vvv tests/
...
==== 45 passed, 5 skipped, 1 xfailed in 123.73s (0:02:03) =====
rmccorm4 commented 5 months ago

This should be ~90% of the changes. Going to do some local testing and CI runs to see what falls out of it.

fpetrini15 commented 5 months ago

From what I've read of the ergonomics PR, it seems like this PR needs to be merged first. How did you want to do this? Did you want to back out the overlapping changes between the PRs or merge them and deal with testing and merge collisions in the other PR?

rmccorm4 commented 5 months ago

@fpetrini15 I ended up pulling that PR's changes into this one over the weekend, so I closed the other one and will just use this PR.

rmccorm4 commented 5 months ago

Pipelines looking good, 5/5 passes across all CLI jobs :+1: