mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.2k stars 1.58k forks source link

[Question] Omniquant. (AFAIK) scores best for Q. Methods, why no adoption? In any case, is per-tensor quant. best for Mixtral/MoE models? #2247

Open BuildBackBuehler opened 6 months ago

BuildBackBuehler commented 6 months ago

A real 2 for 1 in my question, but they are related...I suppose! I was looking to integrate Omniquant. myself, because it performs about 5% better than MLC quants (I forget what method(s) was used in the user tests I looked at). 5% isn't a lot but considering MLC is the best platform out (in my humble opinion), it performs a whole lot better than vanilla llama.cpp AWQ and the like.

https://github.com/OpenGVLab/OmniQuant/tree/main

It has been around for awhile and was spotlighted by the ICLR 2024 Conference (presentation being given in a week or so, actually so maybe it'll get more recognition then heh).

Of course OQ is used for all models -- no specialization, and while I was looking at how to format OQ to fit into MLC's Quantization I noticed the new per-tensor method added. Looked to me from its script that it is geared towards Mixtral. Just wondering if that's the case and if anyone/MLC has any experience or results to put me in the right direction, thanks!

vinx13 commented 6 months ago

per-tensor quantization that was added recently is for fp8, so far we have tested on mixtral and llama and more work such as calibration scale is in progress