llama3 qaunt error - Githubissues

bdambrosio commented 2 months ago

cd ../../../exllamav2 export CUDA_VISIBLE_DEVICES=2 python3 convert.py -i ../models/llama3-70B-Instruct -o llama3-70B-Instruct-exl2 -cf llama3-70B-Instruct-exl2 -l 2048 -b 8.0 -hb 8 -ss 8192

model.layers.79.mlp 8.0333 bpw - exp. error: 0.00185775 -- Total exp. error: 0.121538160582 -- Tokenizing samples... -- Token embeddings again... -- Quantizing... -- Layer: model.layers.0 (Attention) -- Linear: model.layers.0.self_attn.q_proj -> 1:6b_32g s4, 6.13 bpw -- Linear: model.layers.0.self_attn.k_proj -> 1:6b_32g s4, 6.16 bpw !! Warning, difference of (0.015625, 0.015625) between unpacked and dequantized matrices -- Linear: model.layers.0.self_attn.v_proj -> 1:8b_32g s4, 8.16 bpw
Quantization error (2)

turboderp commented 2 months ago

This should be fixed in the dev branch. Once I'm done quantizing (and testing) all the 70B versions I'll release v0.0.19 with the fixes.

bdambrosio commented 2 months ago

thanks! should have figured you had already spotted it!

turboderp / exllamav2

llama3 qaunt error #414

Quantization error (2)