QuaRot: KV cache quantization

microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications

MIT License

354 stars 31 forks source link

Closed nailimixaM closed 3 months ago

nailimixaM commented 3 months ago

Reproduces Tables 10 and 11 to within 0.01 PPL -> KV cache quantization (with no other quantization in model) working fully as expected.

Tested weight, activation and KV cache quantization i.e. end-to-end RTN: reproduces full 6- and 8-bit PPL results. For full 4-bit we get:

Llama-2 7B: 8.60 vs 8.37 in paper (i.e. 0.23 worse) Llama-2 13B: 6.34 vs 6.09 in paper (i.e. 0.25 worse)