Fp8 support KV-Cache - Githubissues

Feature request

With more FP8 supported instances becoming available on all major platforms it would be nice if TGI can take advantages of this and start adding FP8 specific features, e.g. FP8 E4M3 KV Cache. I found a WIP branch which seems to be on hold

vLLM integration: https://docs.vllm.ai/en/v0.4.2/quantization/fp8_e4m3_kvcache.html

Motivation

Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache, improving throughput.

huggingface / text-generation-inference

Fp8 support KV-Cache #2027

Feature request

Motivation

Your contribution