Open federicoparra opened 6 months ago
Thanks for the suggestion, we are still focusing on a major refactoring push to stablize the universal deployment use-case so cannot quickly add new format support as of now.
This is something that i think would be good to explore as community effort. The main thing needed here is a customized loader that loads weight, and a quantization scheme(which maps loaded weights into the target weights)
Perhaps a converter? So far in general contributors produce GGUF quantized versions of models doing post training quantization, but if, like Microsoft, other large vendors begin providing quantization-aware training quantized weights in GGUF format it would be great to be able to import them.
right, the loader and quantization combined would be effectively a converter like you mentioned
⚙️ Request New Models
Additional context
I know others have made this request already (https://github.com/mlc-ai/mlc-llm/issues/2246, https://github.com/mlc-ai/mlc-llm/pull/2222, https://github.com/mlc-ai/mlc-llm/issues/2238, https://github.com/mlc-ai/mlc-llm/issues/2205).
But I am requesting something different: I am suggesting that you do not quantize or modify the weights of the model but that you instead use Microsoft's already 4-bit quantized weights.
The reason is that I suspect (although it is not explicit in their repo) they used quantization-aware training to build these GGUF files. I have tested the regular 32-bit model vs the GGUF 4-bit one and the performance is almost equivalent which is not what I've seen so far with MLC's quantized models (they tend to be more inaccurate compared to their 32-bit counterparts).
Is there a way to use Microsoft's own quantized weights?
Thank you! Federico