Closed alexbrowngh closed 3 months ago
There is a load_in_q4
param that can be passed in ExLlamaV2Config
that allows on-the-fly Q4 quant of an fp16 model, although it is slow and poor quality compared to the regular quants.
There is load_in_q4
, yes, but that's mostly for testing purposes, so I can verify that a new architecture works before I move on to fixing any issues with quantizing it, for models that I can't load in full precision.
A strictly RTN quantization isn't a terrible idea, but I'm not sure it's worth the added complexity and the confusion it would cause when people fail to understand that the RTN models aren't just a faster way to make EXL2 models and start distributing them. Probably the way to go would be to add support for AWQ or some other simple format.
Not really where my priorities are at the moment though.
I'm going to close this for now. The idea is fine, and I'll keep it somewhere on the todo list, or maybe I'll improve the load_in_q4
option with dedicated kernels. For now there's too much going on.
ExLlamaV2 is undoubtedly one of the most popular and efficient backends for running LLMs locally. However, the current data-driven quantization method, while producing highly accurate models, can be quite time-consuming due to the numerous forward passes required. This can make it difficult to quickly assess new models through quantization.
I'd like to propose adding a quick, non-data-driven quantization method to ExLlamaV2. This would allow users to gain a rapid understanding of a new model's capabilities before potentially opting for a more comprehensive, data-driven quantization process.
This initial, faster method could serve as an excellent starting point for evaluating models, while the existing data-driven approach would remain available for those seeking the highest levels of accuracy.
Thank you for considering this feature request!