With more FP8 supported instances becoming available on all major platforms it would be nice if TGI can take advantages of this and start adding FP8 specific features, e.g. FP8 E4M3 KV Cache. I found a WIP branch which seems to be on hold
Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache, improving throughput.
Feature request
With more FP8 supported instances becoming available on all major platforms it would be nice if TGI can take advantages of this and start adding FP8 specific features, e.g.
FP8 E4M3 KV Cache
. I found a WIP branch which seems to be on holdvLLM integration: https://docs.vllm.ai/en/v0.4.2/quantization/fp8_e4m3_kvcache.html
Motivation
Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache, improving throughput.
Your contribution
.