[Question] Omniquant. (AFAIK) scores best for Q. Methods, why no adoption? In any case, is per-tensor quant. best for Mixtral/MoE models?

A real 2 for 1 in my question, but they are related...I suppose! I was looking to integrate Omniquant. myself, because it performs about 5% better than MLC quants (I forget what method(s) was used in the user tests I looked at). 5% isn't a lot but considering MLC is the best platform out (in my humble opinion), it performs a whole lot better than vanilla llama.cpp AWQ and the like.

https://github.com/OpenGVLab/OmniQuant/tree/main

It has been around for awhile and was spotlighted by the ICLR 2024 Conference (presentation being given in a week or so, actually so maybe it'll get more recognition then heh).

Of course OQ is used for all models -- no specialization, and while I was looking at how to format OQ to fit into MLC's Quantization I noticed the new per-tensor method added. Looked to me from its script that it is geared towards Mixtral. Just wondering if that's the case and if anyone/MLC has any experience or results to put me in the right direction, thanks!

mlc-ai / mlc-llm

[Question] Omniquant. (AFAIK) scores best for Q. Methods, why no adoption? In any case, is per-tensor quant. best for Mixtral/MoE models? #2247