turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.52k stars 271 forks source link

Quick, Non-Data-Driven Quantization #482

Closed alexbrowngh closed 3 months ago

alexbrowngh commented 3 months ago

ExLlamaV2 is undoubtedly one of the most popular and efficient backends for running LLMs locally. However, the current data-driven quantization method, while producing highly accurate models, can be quite time-consuming due to the numerous forward passes required. This can make it difficult to quickly assess new models through quantization.

I'd like to propose adding a quick, non-data-driven quantization method to ExLlamaV2. This would allow users to gain a rapid understanding of a new model's capabilities before potentially opting for a more comprehensive, data-driven quantization process.

This initial, faster method could serve as an excellent starting point for evaluating models, while the existing data-driven approach would remain available for those seeking the highest levels of accuracy.

Thank you for considering this feature request!

DocShotgun commented 3 months ago

There is a load_in_q4 param that can be passed in ExLlamaV2Config that allows on-the-fly Q4 quant of an fp16 model, although it is slow and poor quality compared to the regular quants.

turboderp commented 3 months ago

There is load_in_q4, yes, but that's mostly for testing purposes, so I can verify that a new architecture works before I move on to fixing any issues with quantizing it, for models that I can't load in full precision.

A strictly RTN quantization isn't a terrible idea, but I'm not sure it's worth the added complexity and the confusion it would cause when people fail to understand that the RTN models aren't just a faster way to make EXL2 models and start distributing them. Probably the way to go would be to add support for AWQ or some other simple format.

Not really where my priorities are at the moment though.

turboderp commented 3 months ago

I'm going to close this for now. The idea is fine, and I'll keep it somewhere on the todo list, or maybe I'll improve the load_in_q4 option with dedicated kernels. For now there's too much going on.