mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.1k stars 1.57k forks source link

Phi-3 mini 4k instruct with MICROSOFT's quantization #2273

Open federicoparra opened 6 months ago

federicoparra commented 6 months ago

⚙️ Request New Models

Additional context

I know others have made this request already (https://github.com/mlc-ai/mlc-llm/issues/2246, https://github.com/mlc-ai/mlc-llm/pull/2222, https://github.com/mlc-ai/mlc-llm/issues/2238, https://github.com/mlc-ai/mlc-llm/issues/2205).

But I am requesting something different: I am suggesting that you do not quantize or modify the weights of the model but that you instead use Microsoft's already 4-bit quantized weights.

The reason is that I suspect (although it is not explicit in their repo) they used quantization-aware training to build these GGUF files. I have tested the regular 32-bit model vs the GGUF 4-bit one and the performance is almost equivalent which is not what I've seen so far with MLC's quantized models (they tend to be more inaccurate compared to their 32-bit counterparts).

Is there a way to use Microsoft's own quantized weights?

Thank you! Federico

tqchen commented 6 months ago

Thanks for the suggestion, we are still focusing on a major refactoring push to stablize the universal deployment use-case so cannot quickly add new format support as of now.

This is something that i think would be good to explore as community effort. The main thing needed here is a customized loader that loads weight, and a quantization scheme(which maps loaded weights into the target weights)

federicoparra commented 6 months ago

Perhaps a converter? So far in general contributors produce GGUF quantized versions of models doing post training quantization, but if, like Microsoft, other large vendors begin providing quantization-aware training quantized weights in GGUF format it would be great to be able to import them.

tqchen commented 6 months ago

right, the loader and quantization combined would be effectively a converter like you mentioned