Memory avalable for KV using Triton TRT-LLM backed is lower than using TRT-LLM directly

UnyieldingOrca commented 3 months ago

System Info

ec2 instance - g5.12xlarge ami - ami-0d8667b0f72471655

Who can help?

Hi, I'm writing to ask about a discrepancy I'm seeing when trying to run mistral-7b on multi-gpu using triton with the TRT-LLM backend. I can successfully compile and run the model using TRT-LLM directly using https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/run.py. But the model fails to load when using the provided scripts/launch_triton_server.py script with the following error:

[TensorRT-LLM][INFO] Allocate 956301312 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 29184 total tokens in paged KV cache, and 272 blocks per sequence
E0306 16:40:07.708802 117 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: maxTokensInPagedKvCache (29184) must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width (1) * tokensPerBlock (128) * maxBlocksPerSeq (272))

Here I am using the default values for the kv store size.

The model runs fine when using https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/run.py and interestingly reports the following KV store size:

[TensorRT-LLM][INFO] Allocate 7767851008 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 237056 tokens in paged KV cache.

This cache size is about 8x larger than what is reported in triton

When monitoring nvidia-smi I noticed 16 tritonserver processing being listed. I modified scripts/launch_triton_server.py to set CUDA_VISIBLE_DEVICES={RANK} and the number of listed processes dropped to 4 and the model was able to load and I was able to call the endpoint with an example query.

With the fix the following KV store size was reported:

[TensorRT-LLM][INFO] Allocate 1862270976 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 56832 total tokens in paged KV cache, and 272 blocks per sequence

This cache size is about 2x larger than without my edit to the launch server script, but still about 4x smaller than running with TRT directly.

I got this model to work with the values provided below but I wanted to post to see if this discrepancy is expected and if my changes to launch_triton_server.py is valid and maybe should be updated in the repo.

@kaiyux @juney-nvidia

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Setup:

git clone https://github.com/NVIDIA/TensorRT-LLM.git && cd TensorRT-LLM
git submodule update --init --recursive && git lfs install && git lfs pull && make -C docker release_build CUDA_ARCHS="86-real" && make -C docker release_run

Compile model:

cd examples/llama/
pip install -r requirements.txt
huggingface-cli download mistralai/Mistral-7B-v0.1 --cache-dir ./mistral-7b-cache --local-dir ./mistral-7b-hf --local-dir-use-symlinks False
python convert_checkpoint.py --model_dir ./mistral-7b-hf/ --output_dir ./tllm_checkpoints/mistral-7b/tp4 --dtype float16 --tp_size 4
trtllm-build --checkpoint_dir ./tllm_checkpoints/mistral-7b/tp4 --output_dir ./trt_engines/mistral-7b/tp4 --gemm_plugin float16 --workers 4 --use_custom_all_reduce disable --max_input_len 32768 --max_output_len 2000 --max_batch_size 16

Command for running TRT directly

mpirun -n 4 python3 ../run.py --engine_dir ./trt_engines/mistral-7b/tp4 --tokenizer_dir ./mistral-7b-hf/ --max_output_len 8 --input_text "I love french quiche"

Commands for running triton

docker run -v {model-repo-dir}:/model-repo --gpus all --rm --shm-size 32G -p8000:8000 -it nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 /bin/bash

## Manually copy in launch_triton_server.py script
python3 ./launch_server.py --world_size 4 --model_repo /model-repo

## Edit server script to include `-x f'CUDA_VISIBLE_DEVICES={i}` in mpirun args
python3 ./launch_server.py --world_size 4 --model_repo /model-repo

Expected behavior

I would expect the available memory for the KV store to be the same between directly running TRT-LLM and using Triton with the TRT-LLM backend

actual behavior

Using the official script the KV store size is 16x smaller, with my modification it is still 4x less.

additional notes

byshiue commented 3 months ago

The kv cache sizes are controlled by max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction described here. Please try setting them to proper value.

UnyieldingOrca commented 3 months ago

Hi, for all tests kv_cache_free_gpu_mem_fraction was set to 0.9 and the gpu memory utilization was near 100%.

byshiue commented 3 months ago

The gpu memory utilization is near 100% because the kv cache manager allocate 90% of free memory for kv cache. If you don't want so many memory for kv cache, you should adjust that.

triton-inference-server / tensorrtllm_backend