A real 2 for 1 in my question, but they are related...I suppose! I was looking to integrate Omniquant. myself, because it performs about 5% better than MLC quants (I forget what method(s) was used in the user tests I looked at). 5% isn't a lot but considering MLC is the best platform out (in my humble opinion), it performs a whole lot better than vanilla llama.cpp AWQ and the like.
It has been around for awhile and was spotlighted by the ICLR 2024 Conference (presentation being given in a week or so, actually so maybe it'll get more recognition then heh).
Of course OQ is used for all models -- no specialization, and while I was looking at how to format OQ to fit into MLC's Quantization I noticed the new per-tensor method added. Looked to me from its script that it is geared towards Mixtral. Just wondering if that's the case and if anyone/MLC has any experience or results to put me in the right direction, thanks!
per-tensor quantization that was added recently is for fp8, so far we have tested on mixtral and llama and more work such as calibration scale is in progress
A real 2 for 1 in my question, but they are related...I suppose! I was looking to integrate Omniquant. myself, because it performs about 5% better than MLC quants (I forget what method(s) was used in the user tests I looked at). 5% isn't a lot but considering MLC is the best platform out (in my humble opinion), it performs a whole lot better than vanilla llama.cpp AWQ and the like.
https://github.com/OpenGVLab/OmniQuant/tree/main
It has been around for awhile and was spotlighted by the ICLR 2024 Conference (presentation being given in a week or so, actually so maybe it'll get more recognition then heh).
Of course OQ is used for all models -- no specialization, and while I was looking at how to format OQ to fit into MLC's Quantization I noticed the new per-tensor method added. Looked to me from its script that it is geared towards Mixtral. Just wondering if that's the case and if anyone/MLC has any experience or results to put me in the right direction, thanks!