irthomasthomas / undecidability

2 stars 1 forks source link

GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga #304

Open irthomasthomas opened 4 months ago

irthomasthomas commented 4 months ago

GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes

Mod Post Size (mb) Model 16560 Phind_Phind-CodeLlama-34B-v2-EXL2-4.000b 17053 Phind_Phind-CodeLlama-34B-v2-EXL2-4.125b 17463 Phind-CodeLlama-34B-v2-AWQ-4bit-128g 17480 Phind-CodeLlama-34B-v2-GPTQ-4bit-128g-actorder 17548 Phind_Phind-CodeLlama-34B-v2-EXL2-4.250b 18143 Phind_Phind-CodeLlama-34B-v2-EXL2-4.400b 19133 Phind_Phind-CodeLlama-34B-v2-EXL2-4.650b 19284 phind-codellama-34b-v2.Q4_K_M.gguf 19320 Phind-CodeLlama-34B-v2-AWQ-4bit-32g 19337 Phind-CodeLlama-34B-v2-GPTQ-4bit-32g-actorder I created all these EXL2 quants to compare them to GPTQ and AWQ. The preliminary result is that EXL2 4.4b seems to outperform GPTQ-4bit-32g while EXL2 4.125b seems to outperform GPTQ-4bit-128g while using less VRAM in both cases.

I couldn't test AWQ yet because my quantization ended up broken, possibly due to this particular model using NTK scaling, so I'll probably have to go through the fun of burning my GPU for 16 hours again to quantize and evaluate another model so that a conclusion can be reached.

Also no idea if Phind-CodeLlama is actually good. WizardCoder-Python might be better.

Suggested labels

"LLM-Quantization"