Open UnyieldingOrca opened 3 months ago
The kv cache sizes are controlled by max_tokens_in_paged_kv_cache
and kv_cache_free_gpu_mem_fraction
described here. Please try setting them to proper value.
Hi, for all tests kv_cache_free_gpu_mem_fraction
was set to 0.9 and the gpu memory utilization was near 100%.
The gpu memory utilization is near 100% because the kv cache manager allocate 90% of free memory for kv cache. If you don't want so many memory for kv cache, you should adjust that.
System Info
ec2 instance - g5.12xlarge ami - ami-0d8667b0f72471655
Who can help?
Hi, I'm writing to ask about a discrepancy I'm seeing when trying to run mistral-7b on multi-gpu using triton with the TRT-LLM backend. I can successfully compile and run the model using TRT-LLM directly using https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/run.py. But the model fails to load when using the provided
scripts/launch_triton_server.py
script with the following error:Here I am using the default values for the kv store size.
The model runs fine when using https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/run.py and interestingly reports the following KV store size:
This cache size is about 8x larger than what is reported in triton
When monitoring nvidia-smi I noticed 16 tritonserver processing being listed. I modified
scripts/launch_triton_server.py
to setCUDA_VISIBLE_DEVICES={RANK}
and the number of listed processes dropped to 4 and the model was able to load and I was able to call the endpoint with an example query.With the fix the following KV store size was reported:
This cache size is about 2x larger than without my edit to the launch server script, but still about 4x smaller than running with TRT directly.
I got this model to work with the values provided below but I wanted to post to see if this discrepancy is expected and if my changes to launch_triton_server.py is valid and maybe should be updated in the repo.
@kaiyux @juney-nvidia
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Setup:
Compile model:
Command for running TRT directly
Commands for running triton
Expected behavior
I would expect the available memory for the KV store to be the same between directly running TRT-LLM and using Triton with the TRT-LLM backend
actual behavior
Using the official script the KV store size is 16x smaller, with my modification it is still 4x less.
additional notes