turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 236 forks source link

AQML compression/quantization #336

Open Tedy50 opened 4 months ago

Tedy50 commented 4 months ago

Looks like there is some recent idea about compressing the model during quantization with pretty good results https://twitter.com/rohanpaul_ai/status/1755521957058257033

exllama is currently the best thing we have for AI in terms of performance, but this model compression could move it to the next level allowing it to fit bigger models into VRAM