huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.97k stars 1.06k forks source link

How to turn on the KV cache when serve a model? #2583

Open hahmad2008 opened 1 month ago

hahmad2008 commented 1 month ago

System Info

TGI 2.3.0

Information

Tasks

Reproduction

The TTFT is really slower than VLLM. Can't be improved? if so how to turn on the KV cache when launch a model?

model=HuggingFaceH4/zephyr-7b-beta
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id $model

Expected behavior

Improve the TTFT and latency

danieldk commented 4 days ago

Could you give a comparison of the differences you are seeing? The KV cache is always used on supported models. Note that TTFT does is not influenced by using a KV cache, only decoding is.

Or did you mean prefix caching? (Which is supported by >= TGI 2.3.0.)