turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.52k stars 271 forks source link

EXL2 format spec? #494

Closed polarathene closed 3 months ago

polarathene commented 3 months ago

Is the EXL2 format documented anywhere like GGUF is? Or is it only intended to be used with this project?

turboderp commented 3 months ago

The file format is safetensors, and the overall structure of an EXL2 model is the same as the source HF model that it's converted from. A few "special" architectures use a little bit of key remapping to try to massage everything into a similar shape so I can share as much code as possible between them. Also a couple of architectures that use tied embeddings have their embedding layer copied to produce an output layer that can be quantized so the FP16 embeddings can remain in system RAM.

Aside from that, the difference between a HF model and an EXL2 model is more or less just that most of the .weight tensors have been replaced by the following, to represent a [k, n] row-major matrix, with g groups:

I wrote a bit about it here, specifically how it relates to the matmul kernel but covering most of how the scale is quantized, what the group map is and stuff like that.

turboderp commented 3 months ago

I should probably clarify that the shuffle is only applied at load time.

Also need to close some of these issues. If there's anything else, lmk.

polarathene commented 3 months ago

No worries! Thanks for sharing those insights, I passed them on to someone who better understands the technical details for implementing EXL2 loading support 👍