Open hahmad2008 opened 1 month ago
Could you give a comparison of the differences you are seeing? The KV cache is always used on supported models. Note that TTFT does is not influenced by using a KV cache, only decoding is.
Or did you mean prefix caching? (Which is supported by >= TGI 2.3.0.)
System Info
TGI 2.3.0
Information
Tasks
Reproduction
The TTFT is really slower than VLLM. Can't be improved? if so how to turn on the KV cache when launch a model?
Expected behavior
Improve the TTFT and latency