huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.87k stars 1.05k forks source link

Fp8 support KV-Cache #2027

Closed philschmid closed 2 months ago

philschmid commented 4 months ago

Feature request

With more FP8 supported instances becoming available on all major platforms it would be nice if TGI can take advantages of this and start adding FP8 specific features, e.g. FP8 E4M3 KV Cache. I found a WIP branch which seems to be on hold

vLLM integration: https://docs.vllm.ai/en/v0.4.2/quantization/fp8_e4m3_kvcache.html

Motivation

Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache, improving throughput.

Your contribution

.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.