microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications
MIT License
354 stars 31 forks source link

QuaRot: Add activation and KV cache quantization, GPTQ, Phi3, Groupsizes #149

Closed nailimixaM closed 2 months ago

nailimixaM commented 3 months ago

KV cache quantization: reproduces Tables 10 and 11 to within 0.01 PPL -> i.e. working fully as expected.

Tested weight, activation and KV cache quantization i.e. end-to-end RTN: reproduces full 6- and 8-bit PPL results. For full 4-bit we get:

Llama-2 7B: 8.60 vs 8.37 in paper (i.e. 0.23 worse) Llama-2 13B: 6.34 vs 6.09 in paper (i.e. 0.25 worse)

Given A16W4 was 0.1 PPL worse on 7 and 13B models I think there must be a minor bug somewhere in symmetric RTN (KV cache quantization uses asymmetric RTN). I have some ideas so will investigate.