turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.69k stars 283 forks source link

[REQUEST]Is it possible to load a model as NF4 and convert it to Exl2? #652

Closed charleswg closed 1 month ago

turboderp commented 1 month ago

You can convert it to FP16 in PyTorch, then convert to EXL2 after, but the EXL2 conversion script can't read NF4 models directly, no. I would expect significant degradation either way, though.

charleswg commented 1 month ago

Thanks for the reply. I deleted the original question because I realized this too.

My original intent of doing so is I found that that Qwen 2.5 72B NF4 model performs exceptionally well compare to 72B 8bpw, 72B 4bpw, 72B Q8_0 and 72B Q4_K_M and even 4o and think if there's a way to accelerate NF4 model with Exl2. Granted NF4 model hasn't been scoring highest in perplexity comparison, I think it could win in following instruction strictly without much loss compare to the original model.