triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
588 stars 81 forks source link

[Question]What does the service parameter max_tokens_in_paged_kv_cache mean? #67

Open wjj19950828 opened 8 months ago

wjj19950828 commented 8 months ago

I have run through the entire process of llama2 and want to stress test and see the benchmark indicators.

Regarding max_tokens_in_paged_kv_cache, I may not understand it well

Is it similar to the max_num_batched_tokens parameter of vllm?

Thanks~

idealover commented 8 months ago

Hi, which GPU did you use?

wjj19950828 commented 8 months ago

Hi, which GPU did you use?

A100

byshiue commented 7 months ago

It means the maximun tokens we can save in our paged kv cache.