Closed charleswg closed 1 month ago
Thanks for the reply. I deleted the original question because I realized this too.
My original intent of doing so is I found that that Qwen 2.5 72B NF4 model performs exceptionally well compare to 72B 8bpw, 72B 4bpw, 72B Q8_0 and 72B Q4_K_M and even 4o and think if there's a way to accelerate NF4 model with Exl2. Granted NF4 model hasn't been scoring highest in perplexity comparison, I think it could win in following instruction strictly without much loss compare to the original model.
You can convert it to FP16 in PyTorch, then convert to EXL2 after, but the EXL2 conversion script can't read NF4 models directly, no. I would expect significant degradation either way, though.