mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MIT License
2.39k stars 184 forks source link

Vicuna-1.5 Quantized Weights #73

Open mmaaz60 opened 1 year ago

mmaaz60 commented 1 year ago

Hi Authors,

Any plans to release Vicuna-1.5 quantized weights? Thanks

casper-hansen commented 1 year ago

Hi Authors,

Any plans to release Vicuna-1.5 quantized weights? Thanks

Hi @mmaaz60, do you have access to a GPU? If so, I believe it should be easy enough to quantize the model since it is using the LLaMa-2 architecture.

See https://github.com/mit-han-lab/llm-awq#usage

mmaaz60 commented 1 year ago

Hi, I did the quantization, however the weights are not working with FastChat.

CUDA illegal memory access error is coming. However, the model is working in this repo.

casper-hansen commented 1 year ago

not working with FastChat.

I see. This may be the fault of FastChat and not AWQ. Did you try TinyChat?

mmaaz60 commented 1 year ago

Hi @casperbh96,

Unfortunately, it is not working with TinyChat as well. I got the following error,

 Missing key(s) in state_dict: "model.embed_tokens.weight", "model.layers.0.self_attn.rotary_emb.inv_freq", "model.layers.0.self_attn.k_proj.qweight", "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.k_proj.scales", "model.layers.0.self_attn.o_proj.qweight", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.o_proj.scales", "model.layers.0.self_attn.q_proj.qweight", "model.layers.0.self_attn.q_proj.qzeros", "mod
el.layers.0.self_attn.q_proj.scales", "model.layers.0.self_attn.v_proj.qweight", "model.layers.0.self_attn.v_proj.qzeros", "model.layers.0.self_attn.v_proj.scales", "model.layers.0.mlp.down_proj.qweight", "model.layers.0.mlp.down_proj.qzeros", "model.layers.0.mlp.down_proj.scales", "model.layers.0.mlp.gate_proj.qweight", "model.layers.0.mlp.gate_proj.qzeros", "model.layers.0.mlp.gate_proj.scales", "model.layers.0.mlp.up_proj.qweight", "model.layers.0.mlp
.up_proj.qzeros", "model.layers.0.mlp.up_proj.scales", "model.layers.0.input_layernorm.weight", "model.layers.0.post_attention_layernorm.weight", "model.layers.1.self_attn.rotary_emb.inv_freq", "model.layers.1.self_attn.k_proj.qweight", "model.layers.1.self_attn.k_proj.qzeros", "model.layers.1.self_attn.k_proj.scales", "model.layers.1.self_attn.o_proj.qweight", "model.layers.1.self_attn.o_proj.qzeros", "model.layers.1.self_attn.o_proj.scales", "model.lay
ers.1.self_attn.q_proj.qweight", "model.layers.1.self_attn.q_proj.qzeros", "model.layers.1.self_attn.q_proj.scales", "model.layers.1.self_attn.v_proj.qweight", "model.layers.1.self_attn.v_proj.qzeros", "model.layers.1.self_attn.v_proj.scales", "model.layers.1.mlp.down_proj.qweight", "model.layers.1.mlp.down_proj.qzeros", "model.layers.1.mlp.down_proj.scales", "model.layers.1.mlp.gate_proj.qweight", "model.layers.1.mlp.gate_proj.qzeros", "model.layers.1.m
lp.gate_proj.scales", "model.layers.1.mlp.up_proj.qweight", "model.layers.1.mlp.up_proj.qzeros", "model.layers.1.mlp.up_proj.scales", "model.layers.1.input_layernorm.weight", "model.layers.1.post_attention_layernorm.weight", "model.layers.2.self_attn.rotary_emb.inv_freq", "model.layers.2.self_attn.k_proj.qweight", "model.layers.2.self_attn.k_proj.qzeros", "model.layers.2.self_attn.k_proj.scales", "model.layers.2.self_attn.o_proj.qweight", "model.layers.2
.self_attn.o_proj.qzeros", "model.layers.2.self_attn.o_proj.scales", "model.layers.2.self_attn.q_proj.qweight", "model.layers.2.self_attn.q_proj.qzeros", "model.layers.2.self_attn.q_proj.scales", "model.layers.2.self_attn.v_proj.qweight", "model.layers.2.self_attn.v_proj.qzeros", "model.layers.2.self_attn.v_proj.scales", "model.layers.2.mlp.down_proj.qweight", "model.layers.2.mlp.down_proj.qzeros", "model.layers.2.mlp.down_proj.scales", "model.layers.2.m
lp.gate_proj.qweight", "model.layers.2.mlp.gate_proj.qzeros", "model.layers.2.mlp.gate_proj.scales", "model.layers.2.mlp.up_proj.qweight", "model.layers.2.mlp.up_proj.qzeros", "model.layers.2.mlp.up_proj.scales", "model.layers.2.input_layernorm.weight", "model.layers.2.post_attention_layernorm.weight", "model.layers.3.self_attn.rotary_emb.inv_freq", "model.layers.3.self_attn.k_proj.qweight", "model.layers.3.self_attn.k_proj.qzeros", "model.layers.3.self_
attn.k_proj.scales", "model.layers.3.self_attn.o_proj.qweight", "model.layers.3.self_attn.o_proj.qzeros", "model.layers.3.self_attn.o_proj.scales", "model.layers.3.self_attn.q_proj.qweight", "model.layers.3.self_attn.q_proj.qzeros", "model.layers.3.self_attn.q_proj.scales", "model.layers.3.self_attn.v_proj.qweight", "model.layers.3.self_attn.v_proj.qzeros", "model.layers.3.self_attn.v_proj.scales", "model.layers.3.mlp.down_proj.qweight", "model.layers.3.
mlp.down_proj.qzeros", "model.layers.3.mlp.down_proj.scales", "model.layers.3.mlp.gate_proj.qweight", "model.layers.3.mlp.gate_proj.qzeros", "model.layers.3.mlp.gate_proj.scales", "model.layers.3.mlp.up_proj.qweight", "model.layers.3.mlp.up_proj.qzeros", "model.layers.3.mlp.up_proj.scales", "model.layers.3.input_layernorm.weight", "model.layers.3.post_attention_layernorm.weight", "model.layers.4.self_attn.rotary_emb.inv_freq", "model.layers.4.self_attn.k
_proj.qweight", "model.layers.4.self_attn.k_proj.qzeros", "model.layers.4.self_attn.k_proj.scales", "model.layers.4.self_attn.o_proj.qweight", "model.layers.4.self_attn.o_proj.qzeros", "model.layers.4.self_attn.o_proj.scales", "model.layers.4.self_attn.q_proj.qweight", "model.layers.4.self_attn.q_proj.qzeros", "model.layers.4.self_attn.q_proj.scales", "model.layers.4.self_attn.v_proj.qweight", "model.layers.4.self_attn.v_proj.qzeros", "model.layers.4.sel
f_attn.v_proj.scales", "model.layers.4.mlp.down_proj.qweight", "model.layers.4.mlp.down_proj.qzeros", "model.layers.4.mlp.down_proj.scales", "model.layers.4.mlp.gate_proj.qweight", "model.layers.4.mlp.gate_proj.qzeros", "model.layers.4.mlp.gate_proj.scales", "model.layers.4.mlp.up_proj.qweight", "model.layers.4.mlp.up_proj.qzeros", "model.layers.4.mlp.up_proj.scales", "model.layers.4.input_layernorm.weight", "model.layers.4.post_attention_layernorm.weigh
t", "model.layers.5.self_attn.rotary_emb.inv_freq", "model.layers.5.self_attn.k_proj.qweight", "model.layers.5.self_attn.k_proj.qzeros", "model.layers.5.self_attn.k_proj.scales", "model.layers.5.self_attn.o_proj.qweight", "model.layers.5.self_attn.o_proj.qzeros", "model.layers.5.self_attn.o_proj.scales", "model.layers.5.self_attn.q_proj.qweight", "model.layers.5.self_attn.q_proj.qzeros", "model.layers.5.self_attn.q_proj.scales", "model.layers.5.self_attn
.v_proj.qweight", "model.layers.5.self_attn.v_proj.qzeros", "model.layers.5.self_attn.v_proj.scales", "model.layers.5.mlp.down_proj.qweight", "model.layers.5.mlp.down_proj.qzeros", "model.layers.5.mlp.down_proj.scales", "model.layers.5.mlp.gate_proj.qweight", "model.layers.5.mlp.gate_proj.qzeros", "model.layers.5.mlp.gate_proj.scales", "model.layers.5.mlp.up_proj.qweight", "model.layers.5.mlp.up_proj.qzeros", "model.layers.5.mlp.up_proj.scales", "model.l
ayers.5.input_layernorm.weight", "model.layers.5.post_attention_layernorm.weight", "model.layers.6.self_attn.rotary_emb.inv_freq", "model.layers.6.self_attn.k_proj.qweight", "model.layers.6.self_attn.k_proj.qzeros", "model.layers.6.self_attn.k_proj.scales", "model.layers.6.self_attn.o_proj.qweight", "model.layers.6.self_attn.o_proj.qzeros", "model.layers.6.self_attn.o_proj.scales", "model.layers.6.self_attn.q_proj.qweight", "model.layers.6.self_attn.q_pr
oj.qzeros", "model.layers.6.self_attn.q_proj.scales", "model.layers.6.self_attn.v_proj.qweight", "model.layers.6.self_attn.v_proj.qzeros", "model.layers.6.self_attn.v_proj.scales", "model.layers.6.mlp.down_proj.qweight", "model.layers.6.mlp.down_proj.qzeros", "model.layers.6.mlp.down_proj.scales", "model.layers.6.mlp.gate_proj.qweight", "model.layers.6.mlp.gate_proj.qzeros", "model.layers.6.mlp.gate_proj.scales", "model.layers.6.mlp.up_proj.qweight", "mo
del.layers.6.mlp.up_proj.qzeros", "model.layers.6.mlp.up_proj.scales", "model.layers.6.input_layernorm.weight", "model.layers.6.post_attention_layernorm.weight", "model.layers.7.self_attn.rotary_emb.inv_freq", "model.layers.7.self_attn.k_proj.qweight", "model.layers.7.self_attn.k_proj.qzeros", "model.layers.7.self_attn.k_proj.scales", "model.layers.7.self_attn.o_proj.qweight", "model.layers.7.self_attn.o_proj.qzeros", "model.layers.7.self_attn.o_proj.sca
les", "model.layers.7.self_attn.q_proj.qweight", "model.layers.7.self_attn.q_proj.qzeros", "model.layers.7.self_attn.q_proj.scales", "model.layers.7.self_attn.v_proj.qweight", "model.layers.7.self_attn.v_proj.qzeros", "model.layers.7.self_attn.v_proj.scales", "model.layers.7.mlp.down_proj.qweight", "model.layers.7.mlp.down_proj.qzeros", "model.layers.7.mlp.down_proj.scales", "model.layers.7.mlp.gate_proj.qweight", "model.layers.7.mlp.gate_proj.qzeros", "
model.layers.7.mlp.gate_proj.scales", "model.layers.7.mlp.up_proj.qweight", "model.layers.7.mlp.up_proj.qzeros", "model.layers.7.mlp.up_proj.scales", "model.layers.7.input_layernorm.weight", "model.layers.7.post_attention_layernorm.weight", "model.layers.8.self_attn.rotary_emb.inv_freq", "model.layers.8.self_attn.k_proj.qweight", "model.layers.8.self_attn.k_proj.qzeros", "model.layers.8.self_attn.k_proj.scales", "model.layers.8.self_attn.o_proj.qweight",
 "model.layers.8.self_attn.o_proj.qzeros", "model.layers.8.self_attn.o_proj.scales", "model.layers.8.self_attn.q_proj.qweight", "model.layers.8.self_attn.q_proj.qzeros", "model.layers.8.self_attn.q_proj.scales", "model.layers.8.self_attn.v_proj.qweight", "model.layers.8.self_attn.v_proj.qzeros", "model.layers.8.self_attn.v_proj.scales", "model.layers.8.mlp.down_proj.qweight", "model.layers.8.mlp.down_proj.qzeros", "model.layers.8.mlp.down_proj.scales", "
model.layers.8.mlp.gate_proj.qweight", "model.layers.8.mlp.gate_proj.qzeros", "model.layers.8.mlp.gate_proj.scales", "model.layers.8.mlp.up_proj.qweight", "model.layers.8.mlp.up_proj.qzeros", "model.layers.8.mlp.up_proj.scales", "model.layers.8.input_layernorm.weight", "model.layers.8.post_attention_layernorm.weight", "model.layers.9.self_attn.rotary_emb.inv_freq", "model.layers.9.self_attn.k_proj.qweight", "model.layers.9.self_attn.k_proj.qzeros", "mode
l.layers.9.self_attn.k_proj.scales", "model.layers.9.self_attn.o_proj.qweight", "model.layers.9.self_attn.o_proj.qzeros", "model.layers.9.self_attn.o_proj.scales", "model.layers.9.self_attn.q_proj.qweight", "model.layers.9.self_attn.q_proj.qzeros", "model.layers.9.self_attn.q_proj.scales", "model.layers.9.self_attn.v_proj.qweight", "model.layers.9.self_attn.v_proj.qzeros", "model.layers.9.self_attn.v_proj.scales", "model.layers.9.mlp.down_proj.qweight", 
"model.layers.9.mlp.down_proj.qzeros", "model.layers.9.mlp.down_proj.scales", "model.layers.9.mlp.gate_proj.qweight", "model.layers.9.mlp.gate_proj.qzeros", "model.layers.9.mlp.gate_proj.scales", "model.layers.9.mlp.up_proj.qweight", "model.layers.9.mlp.up_proj.qzeros", "model.layers.9.mlp.up_proj.scales", "model.layers.9.input_layernorm.weight", "model.layers.9.post_attention_layernorm.weight", "model.layers.10.self_attn.rotary_emb.inv_freq", "model.lay
ers.10.self_attn.k_proj.qweight", "model.layers.10.self_attn.k_proj.qzeros", "model.layers.10.self_attn.k_proj.scales", "model.layers.10.self_attn.o_proj.qweight", "model.layers.10.self_attn.o_proj.qzeros", "model.layers.10.self_attn.o_proj.scales", "model.layers.10.self_attn.q_proj.qweight", "model.layers.10.self_attn.q_proj.qzeros", "model.layers.10.self_attn.q_proj.scales", "model.layers.10.self_attn.v_proj.qweight", "model.layers.10.self_attn.v_proj.
qzeros", "model.layers.10.self_attn.v_proj.scales", "model.layers.10.mlp.down_proj.qweight", "model.layers.10.mlp.down_proj.qzeros", "model.layers.10.mlp.down_proj.scales", "model.layers.10.mlp.gate_proj.qweight", "model.layers.10.mlp.gate_proj.qzeros", "model.layers.10.mlp.gate_proj.scales", "model.layers.10.mlp.up_proj.qweight", "model.layers.10.mlp.up_proj.qzeros", "model.layers.10.mlp.up_proj.scales", "model.layers.10.input_layernorm.weight", "model.
layers.10.post_attention_layernorm.weight", "model.layers.11.self_attn.rotary_emb.inv_freq", "model.layers.11.self_attn.k_proj.qweight", "model.layers.11.self_attn.k_proj.qzeros", "model.layers.11.self_attn.k_proj.scales", "model.layers.11.self_attn.o_proj.qweight", "model.layers.11.self_attn.o_proj.qzeros", "model.layers.11.self_attn.o_proj.scales", "model.layers.11.self_attn.q_proj.qweight", "model.layers.11.self_attn.q_proj.qzeros", "model.layers.11.s
elf_attn.q_proj.scales", "model.layers.11.self_attn.v_proj.qweight", "model.layers.11.self_attn.v_proj.qzeros", "model.layers.11.self_attn.v_proj.scales", "model.layers.11.mlp.down_proj.qweight", "model.layers.11.mlp.down_proj.qzeros", "model.layers.11.mlp.down_proj.scales", "model.layers.11.mlp.gate_proj.qweight", "model.layers.11.mlp.gate_proj.qzeros", "model.layers.11.mlp.gate_proj.scales", "model.layers.11.mlp.up_proj.qweight", "model.layers.11.mlp.u
p_proj.qzeros", "model.layers.11.mlp.up_proj.scales", "model.layers.11.input_layernorm.weight", "model.layers.11.post_attention_layernorm.weight", "model.layers.12.self_attn.rotary_emb.inv_freq", "model.layers.12.self_attn.k_proj.qweight", "model.layers.12.self_attn.k_proj.qzeros", "model.layers.12.self_attn.k_proj.scales", "model.layers.12.self_attn.o_proj.qweight", "model.layers.12.self_attn.o_proj.qzeros", "model.layers.12.self_attn.o_proj.scales", "m
odel.layers.12.self_attn.q_proj.qweight", "model.layers.12.self_attn.q_proj.qzeros", "model.layers.12.self_attn.q_proj.scales", "model.layers.12.self_attn.v_proj.qweight", "model.layers.12.self_attn.v_proj.qzeros", "model.layers.12.self_attn.v_proj.scales", "model.layers.12.mlp.down_proj.qweight", "model.layers.12.mlp.down_proj.qzeros", "model.layers.12.mlp.down_proj.scales", "model.layers.12.mlp.gate_proj.qweight", "model.layers.12.mlp.gate_proj.qzeros"
, "model.layers.12.mlp.gate_proj.scales", "model.layers.12.mlp.up_proj.qweight", "model.layers.12.mlp.up_proj.qzeros", "model.layers.12.mlp.up_proj.scales", "model.layers.12.input_layernorm.weight", "model.layers.12.post_attention_layernorm.weight", "model.layers.13.self_attn.rotary_emb.inv_freq", "model.layers.13.self_attn.k_proj.qweight", "model.layers.13.self_attn.k_proj.qzeros", "model.layers.13.self_attn.k_proj.scales", "model.layers.13.self_attn.o_
proj.qweight", "model.layers.13.self_attn.o_proj.qzeros", "model.layers.13.self_attn.o_proj.scales", "model.layers.13.self_attn.q_proj.qweight", "model.layers.13.self_attn.q_proj.qzeros", "model.layers.13.self_attn.q_proj.scales", "model.layers.13.self_attn.v_proj.qweight", "model.layers.13.self_attn.v_proj.qzeros", "model.layers.13.self_attn.v_proj.scales", "model.layers.13.mlp.down_proj.qweight", "model.layers.13.mlp.down_proj.qzeros", "model.layers.13
.mlp.down_proj.scales", "model.layers.13.mlp.gate_proj.qweight", "model.layers.13.mlp.gate_proj.qzeros", "model.layers.13.mlp.gate_proj.scales", "model.layers.13.mlp.up_proj.qweight", "model.layers.13.mlp.up_proj.qzeros", "model.layers.13.mlp.up_proj.scales", "model.layers.13.input_layernorm.weight", "model.layers.13.post_attention_layernorm.weight", "model.layers.14.self_attn.rotary_emb.inv_freq", "model.layers.14.self_attn.k_proj.qweight", "model.layer
s.14.self_attn.k_proj.qzeros", "model.layers.14.self_attn.k_proj.scales", "model.layers.14.self_attn.o_proj.qweight", "model.layers.14.self_attn.o_proj.qzeros", "model.layers.14.self_attn.o_proj.scales", "model.layers.14.self_attn.q_proj.qweight", "model.layers.14.self_attn.q_proj.qzeros", "model.layers.14.self_attn.q_proj.scales", "model.layers.14.self_attn.v_proj.qweight", "model.layers.14.self_attn.v_proj.qzeros", "model.layers.14.self_attn.v_proj.sca
les", "model.layers.14.mlp.down_proj.qweight", "model.layers.14.mlp.down_proj.qzeros", "model.layers.14.mlp.down_proj.scales", "model.layers.14.mlp.gate_proj.qweight", "model.layers.14.mlp.gate_proj.qzeros", "model.layers.14.mlp.gate_proj.scales", "model.layers.14.mlp.up_proj.qweight", "model.layers.14.mlp.up_proj.qzeros", "model.layers.14.mlp.up_proj.scales", "model.layers.14.input_layernorm.weight", "model.layers.14.post_attention_layernorm.weight", "m
odel.layers.15.self_attn.rotary_emb.inv_freq", "model.layers.15.self_attn.k_proj.qweight", "model.layers.15.self_attn.k_proj.qzeros", "model.layers.15.self_attn.k_proj.scales", "model.layers.15.self_attn.o_proj.qweight", "model.layers.15.self_attn.o_proj.qzeros", "model.layers.15.self_attn.o_proj.scales", "model.layers.15.self_attn.q_proj.qweight", "model.layers.15.self_attn.q_proj.qzeros", "model.layers.15.self_attn.q_proj.scales", "model.layers.15.self
_attn.v_proj.qweight", "model.layers.15.self_attn.v_proj.qzeros", "model.layers.15.self_attn.v_proj.scales", "model.layers.15.mlp.down_proj.qweight", "model.layers.15.mlp.down_proj.qzeros", "model.layers.15.mlp.down_proj.scales", "model.layers.15.mlp.gate_proj.qweight", "model.layers.15.mlp.gate_proj.qzeros", "model.layers.15.mlp.gate_proj.scales", "model.layers.15.mlp.up_proj.qweight", "model.layers.15.mlp.up_proj.qzeros", "model.layers.15.mlp.up_proj.s
cales", "model.layers.15.input_layernorm.weight", "model.layers.15.post_attention_layernorm.weight", "model.layers.16.self_attn.rotary_emb.inv_freq", "model.layers.16.self_attn.k_proj.qweight", "model.layers.16.self_attn.k_proj.qzeros", "model.layers.16.self_attn.k_proj.scales", "model.layers.16.self_attn.o_proj.qweight", "model.layers.16.self_attn.o_proj.qzeros", "model.layers.16.self_attn.o_proj.scales", "model.layers.16.self_attn.q_proj.qweight", "mod
el.layers.16.self_attn.q_proj.qzeros", "model.layers.16.self_attn.q_proj.scales", "model.layers.16.self_attn.v_proj.qweight", "model.layers.16.self_attn.v_proj.qzeros", "model.layers.16.self_attn.v_proj.scales", "model.layers.16.mlp.down_proj.qweight", "model.layers.16.mlp.down_proj.qzeros", "model.layers.16.mlp.down_proj.scales", "model.layers.16.mlp.gate_proj.qweight", "model.layers.16.mlp.gate_proj.qzeros", "model.layers.16.mlp.gate_proj.scales", "mod
el.layers.16.mlp.up_proj.qweight", "model.layers.16.mlp.up_proj.qzeros", "model.layers.16.mlp.up_proj.scales", "model.layers.16.input_layernorm.weight", "model.layers.16.post_attention_layernorm.weight", "model.layers.17.self_attn.rotary_emb.inv_freq", "model.layers.17.self_attn.k_proj.qweight", "model.layers.17.self_attn.k_proj.qzeros", "model.layers.17.self_attn.k_proj.scales", "model.layers.17.self_attn.o_proj.qweight", "model.layers.17.self_attn.o_pr
oj.qzeros", "model.layers.17.self_attn.o_proj.scales", "model.layers.17.self_attn.q_proj.qweight", "model.layers.17.self_attn.q_proj.qzeros", "model.layers.17.self_attn.q_proj.scales", "model.layers.17.self_attn.v_proj.qweight", "model.layers.17.self_attn.v_proj.qzeros", "model.layers.17.self_attn.v_proj.scales", "model.layers.17.mlp.down_proj.qweight", "model.layers.17.mlp.down_proj.qzeros", "model.layers.17.mlp.down_proj.scales", "model.layers.17.mlp.g
ate_proj.qweight", "model.layers.17.mlp.gate_proj.qzeros", "model.layers.17.mlp.gate_proj.scales", "model.layers.17.mlp.up_proj.qweight", "model.layers.17.mlp.up_proj.qzeros", "model.layers.17.mlp.up_proj.scales", "model.layers.17.input_layernorm.weight", "model.layers.17.post_attention_layernorm.weight", "model.layers.18.self_attn.rotary_emb.inv_freq", "model.layers.18.self_attn.k_proj.qweight", "model.layers.18.self_attn.k_proj.qzeros", "model.layers.1
8.self_attn.k_proj.scales", "model.layers.18.self_attn.o_proj.qweight", "model.layers.18.self_attn.o_proj.qzeros", "model.layers.18.self_attn.o_proj.scales", "model.layers.18.self_attn.q_proj.qweight", "model.layers.18.self_attn.q_proj.qzeros", "model.layers.18.self_attn.q_proj.scales", "model.layers.18.self_attn.v_proj.qweight", "model.layers.18.self_attn.v_proj.qzeros", "model.layers.18.self_attn.v_proj.scales", "model.layers.18.mlp.down_proj.qweight",
 "model.layers.18.mlp.down_proj.qzeros", "model.layers.18.mlp.down_proj.scales", "model.layers.18.mlp.gate_proj.qweight", "model.layers.18.mlp.gate_proj.qzeros", "model.layers.18.mlp.gate_proj.scales", "model.layers.18.mlp.up_proj.qweight", "model.layers.18.mlp.up_proj.qzeros", "model.layers.18.mlp.up_proj.scales", "model.layers.18.input_layernorm.weight", "model.layers.18.post_attention_layernorm.weight", "model.layers.19.self_attn.rotary_emb.inv_freq",
 "model.layers.19.self_attn.k_proj.qweight", "model.layers.19.self_attn.k_proj.qzeros", "model.layers.19.self_attn.k_proj.scales", "model.layers.19.self_attn.o_proj.qweight", "model.layers.19.self_attn.o_proj.qzeros", "model.layers.19.self_attn.o_proj.scales", "model.layers.19.self_attn.q_proj.qweight", "model.layers.19.self_attn.q_proj.qzeros", "model.layers.19.self_attn.q_proj.scales", "model.layers.19.self_attn.v_proj.qweight", "model.layers.19.self_a
ttn.v_proj.qzeros", "model.layers.19.self_attn.v_proj.scales", "model.layers.19.mlp.down_proj.qweight", "model.layers.19.mlp.down_proj.qzeros", "model.layers.19.mlp.down_proj.scales", "model.layers.19.mlp.gate_proj.qweight", "model.layers.19.mlp.gate_proj.qzeros", "model.layers.19.mlp.gate_proj.scales", "model.layers.19.mlp.up_proj.qweight", "model.layers.19.mlp.up_proj.qzeros", "model.layers.19.mlp.up_proj.scales", "model.layers.19.input_layernorm.weigh
t", "model.layers.19.post_attention_layernorm.weight", "model.layers.20.self_attn.rotary_emb.inv_freq", "model.layers.20.self_attn.k_proj.qweight", "model.layers.20.self_attn.k_proj.qzeros", "model.layers.20.self_attn.k_proj.scales", "model.layers.20.self_attn.o_proj.qweight", "model.layers.20.self_attn.o_proj.qzeros", "model.layers.20.self_attn.o_proj.scales", "model.layers.20.self_attn.q_proj.qweight", "model.layers.20.self_attn.q_proj.qzeros", "model.
layers.20.self_attn.q_proj.scales", "model.layers.20.self_attn.v_proj.qweight", "model.layers.20.self_attn.v_proj.qzeros", "model.layers.20.self_attn.v_proj.scales", "model.layers.20.mlp.down_proj.qweight", "model.layers.20.mlp.down_proj.qzeros", "model.layers.20.mlp.down_proj.scales", "model.layers.20.mlp.gate_proj.qweight", "model.layers.20.mlp.gate_proj.qzeros", "model.layers.20.mlp.gate_proj.scales", "model.layers.20.mlp.up_proj.qweight", "model.laye
rs.20.mlp.up_proj.qzeros", "model.layers.20.mlp.up_proj.scales", "model.layers.20.input_layernorm.weight", "model.layers.20.post_attention_layernorm.weight", "model.layers.21.self_attn.rotary_emb.inv_freq", "model.layers.21.self_attn.k_proj.qweight", "model.layers.21.self_attn.k_proj.qzeros", "model.layers.21.self_attn.k_proj.scales", "model.layers.21.self_attn.o_proj.qweight", "model.layers.21.self_attn.o_proj.qzeros", "model.layers.21.self_attn.o_proj.
scales", "model.layers.21.self_attn.q_proj.qweight", "model.layers.21.self_attn.q_proj.qzeros", "model.layers.21.self_attn.q_proj.scales", "model.layers.21.self_attn.v_proj.qweight", "model.layers.21.self_attn.v_proj.qzeros", "model.layers.21.self_attn.v_proj.scales", "model.layers.21.mlp.down_proj.qweight", "model.layers.21.mlp.down_proj.qzeros", "model.layers.21.mlp.down_proj.scales", "model.layers.21.mlp.gate_proj.qweight", "model.layers.21.mlp.gate_p
roj.qzeros", "model.layers.21.mlp.gate_proj.scales", "model.layers.21.mlp.up_proj.qweight", "model.layers.21.mlp.up_proj.qzeros", "model.layers.21.mlp.up_proj.scales", "model.layers.21.input_layernorm.weight", "model.layers.21.post_attention_layernorm.weight", "model.layers.22.self_attn.rotary_emb.inv_freq", "model.layers.22.self_attn.k_proj.qweight", "model.layers.22.self_attn.k_proj.qzeros", "model.layers.22.self_attn.k_proj.scales", "model.layers.22.s
elf_attn.o_proj.qweight", "model.layers.22.self_attn.o_proj.qzeros", "model.layers.22.self_attn.o_proj.scales", "model.layers.22.self_attn.q_proj.qweight", "model.layers.22.self_attn.q_proj.qzeros", "model.layers.22.self_attn.q_proj.scales", "model.layers.22.self_attn.v_proj.qweight", "model.layers.22.self_attn.v_proj.qzeros", "model.layers.22.self_attn.v_proj.scales", "model.layers.22.mlp.down_proj.qweight", "model.layers.22.mlp.down_proj.qzeros", "mode
l.layers.22.mlp.down_proj.scales", "model.layers.22.mlp.gate_proj.qweight", "model.layers.22.mlp.gate_proj.qzeros", "model.layers.22.mlp.gate_proj.scales", "model.layers.22.mlp.up_proj.qweight", "model.layers.22.mlp.up_proj.qzeros", "model.layers.22.mlp.up_proj.scales", "model.layers.22.input_layernorm.weight", "model.layers.22.post_attention_layernorm.weight", "model.layers.23.self_attn.rotary_emb.inv_freq", "model.layers.23.self_attn.k_proj.qweight", "
model.layers.23.self_attn.k_proj.qzeros", "model.layers.23.self_attn.k_proj.scales", "model.layers.23.self_attn.o_proj.qweight", "model.layers.23.self_attn.o_proj.qzeros", "model.layers.23.self_attn.o_proj.scales", "model.layers.23.self_attn.q_proj.qweight", "model.layers.23.self_attn.q_proj.qzeros", "model.layers.23.self_attn.q_proj.scales", "model.layers.23.self_attn.v_proj.qweight", "model.layers.23.self_attn.v_proj.qzeros", "model.layers.23.self_attn
.v_proj.scales", "model.layers.23.mlp.down_proj.qweight", "model.layers.23.mlp.down_proj.qzeros", "model.layers.23.mlp.down_proj.scales", "model.layers.23.mlp.gate_proj.qweight", "model.layers.23.mlp.gate_proj.qzeros", "model.layers.23.mlp.gate_proj.scales", "model.layers.23.mlp.up_proj.qweight", "model.layers.23.mlp.up_proj.qzeros", "model.layers.23.mlp.up_proj.scales", "model.layers.23.input_layernorm.weight", "model.layers.23.post_attention_layernorm.
weight", "model.layers.24.self_attn.rotary_emb.inv_freq", "model.layers.24.self_attn.k_proj.qweight", "model.layers.24.self_attn.k_proj.qzeros", "model.layers.24.self_attn.k_proj.scales", "model.layers.24.self_attn.o_proj.qweight", "model.layers.24.self_attn.o_proj.qzeros", "model.layers.24.self_attn.o_proj.scales", "model.layers.24.self_attn.q_proj.qweight", "model.layers.24.self_attn.q_proj.qzeros", "model.layers.24.self_attn.q_proj.scales", "model.lay
ers.24.self_attn.v_proj.qweight", "model.layers.24.self_attn.v_proj.qzeros", "model.layers.24.self_attn.v_proj.scales", "model.layers.24.mlp.down_proj.qweight", "model.layers.24.mlp.down_proj.qzeros", "model.layers.24.mlp.down_proj.scales", "model.layers.24.mlp.gate_proj.qweight", "model.layers.24.mlp.gate_proj.qzeros", "model.layers.24.mlp.gate_proj.scales", "model.layers.24.mlp.up_proj.qweight", "model.layers.24.mlp.up_proj.qzeros", "model.layers.24.ml
p.up_proj.scales", "model.layers.24.input_layernorm.weight", "model.layers.24.post_attention_layernorm.weight", "model.layers.25.self_attn.rotary_emb.inv_freq", "model.layers.25.self_attn.k_proj.qweight", "model.layers.25.self_attn.k_proj.qzeros", "model.layers.25.self_attn.k_proj.scales", "model.layers.25.self_attn.o_proj.qweight", "model.layers.25.self_attn.o_proj.qzeros", "model.layers.25.self_attn.o_proj.scales", "model.layers.25.self_attn.q_proj.qwe
ight", "model.layers.25.self_attn.q_proj.qzeros", "model.layers.25.self_attn.q_proj.scales", "model.layers.25.self_attn.v_proj.qweight", "model.layers.25.self_attn.v_proj.qzeros", "model.layers.25.self_attn.v_proj.scales", "model.layers.25.mlp.down_proj.qweight", "model.layers.25.mlp.down_proj.qzeros", "model.layers.25.mlp.down_proj.scales", "model.layers.25.mlp.gate_proj.qweight", "model.layers.25.mlp.gate_proj.qzeros", "model.layers.25.mlp.gate_proj.sc
ales", "model.layers.25.mlp.up_proj.qweight", "model.layers.25.mlp.up_proj.qzeros", "model.layers.25.mlp.up_proj.scales", "model.layers.25.input_layernorm.weight", "model.layers.25.post_attention_layernorm.weight", "model.layers.26.self_attn.rotary_emb.inv_freq", "model.layers.26.self_attn.k_proj.qweight", "model.layers.26.self_attn.k_proj.qzeros", "model.layers.26.self_attn.k_proj.scales", "model.layers.26.self_attn.o_proj.qweight", "model.layers.26.sel
f_attn.o_proj.qzeros", "model.layers.26.self_attn.o_proj.scales", "model.layers.26.self_attn.q_proj.qweight", "model.layers.26.self_attn.q_proj.qzeros", "model.layers.26.self_attn.q_proj.scales", "model.layers.26.self_attn.v_proj.qweight", "model.layers.26.self_attn.v_proj.qzeros", "model.layers.26.self_attn.v_proj.scales", "model.layers.26.mlp.down_proj.qweight", "model.layers.26.mlp.down_proj.qzeros", "model.layers.26.mlp.down_proj.scales", "model.laye
rs.26.mlp.gate_proj.qweight", "model.layers.26.mlp.gate_proj.qzeros", "model.layers.26.mlp.gate_proj.scales", "model.layers.26.mlp.up_proj.qweight", "model.layers.26.mlp.up_proj.qzeros", "model.layers.26.mlp.up_proj.scales", "model.layers.26.input_layernorm.weight", "model.layers.26.post_attention_layernorm.weight", "model.layers.27.self_attn.rotary_emb.inv_freq", "model.layers.27.self_attn.k_proj.qweight", "model.layers.27.self_attn.k_proj.qzeros", "mod
el.layers.27.self_attn.k_proj.scales", "model.layers.27.self_attn.o_proj.qweight", "model.layers.27.self_attn.o_proj.qzeros", "model.layers.27.self_attn.o_proj.scales", "model.layers.27.self_attn.q_proj.qweight", "model.layers.27.self_attn.q_proj.qzeros", "model.layers.27.self_attn.q_proj.scales", "model.layers.27.self_attn.v_proj.qweight", "model.layers.27.self_attn.v_proj.qzeros", "model.layers.27.self_attn.v_proj.scales", "model.layers.27.mlp.down_pro
j.qweight", "model.layers.27.mlp.down_proj.qzeros", "model.layers.27.mlp.down_proj.scales", "model.layers.27.mlp.gate_proj.qweight", "model.layers.27.mlp.gate_proj.qzeros", "model.layers.27.mlp.gate_proj.scales", "model.layers.27.mlp.up_proj.qweight", "model.layers.27.mlp.up_proj.qzeros", "model.layers.27.mlp.up_proj.scales", "model.layers.27.input_layernorm.weight", "model.layers.27.post_attention_layernorm.weight", "model.layers.28.self_attn.rotary_emb
.inv_freq", "model.layers.28.self_attn.k_proj.qweight", "model.layers.28.self_attn.k_proj.qzeros", "model.layers.28.self_attn.k_proj.scales", "model.layers.28.self_attn.o_proj.qweight", "model.layers.28.self_attn.o_proj.qzeros", "model.layers.28.self_attn.o_proj.scales", "model.layers.28.self_attn.q_proj.qweight", "model.layers.28.self_attn.q_proj.qzeros", "model.layers.28.self_attn.q_proj.scales", "model.layers.28.self_attn.v_proj.qweight", "model.layer
s.28.self_attn.v_proj.qzeros", "model.layers.28.self_attn.v_proj.scales", "model.layers.28.mlp.down_proj.qweight", "model.layers.28.mlp.down_proj.qzeros", "model.layers.28.mlp.down_proj.scales", "model.layers.28.mlp.gate_proj.qweight", "model.layers.28.mlp.gate_proj.qzeros", "model.layers.28.mlp.gate_proj.scales", "model.layers.28.mlp.up_proj.qweight", "model.layers.28.mlp.up_proj.qzeros", "model.layers.28.mlp.up_proj.scales", "model.layers.28.input_laye
rnorm.weight", "model.layers.28.post_attention_layernorm.weight", "model.layers.29.self_attn.rotary_emb.inv_freq", "model.layers.29.self_attn.k_proj.qweight", "model.layers.29.self_attn.k_proj.qzeros", "model.layers.29.self_attn.k_proj.scales", "model.layers.29.self_attn.o_proj.qweight", "model.layers.29.self_attn.o_proj.qzeros", "model.layers.29.self_attn.o_proj.scales", "model.layers.29.self_attn.q_proj.qweight", "model.layers.29.self_attn.q_proj.qzero
s", "model.layers.29.self_attn.q_proj.scales", "model.layers.29.self_attn.v_proj.qweight", "model.layers.29.self_attn.v_proj.qzeros", "model.layers.29.self_attn.v_proj.scales", "model.layers.29.mlp.down_proj.qweight", "model.layers.29.mlp.down_proj.qzeros", "model.layers.29.mlp.down_proj.scales", "model.layers.29.mlp.gate_proj.qweight", "model.layers.29.mlp.gate_proj.qzeros", "model.layers.29.mlp.gate_proj.scales", "model.layers.29.mlp.up_proj.qweight", 
"model.layers.29.mlp.up_proj.qzeros", "model.layers.29.mlp.up_proj.scales", "model.layers.29.input_layernorm.weight", "model.layers.29.post_attention_layernorm.weight", "model.layers.30.self_attn.rotary_emb.inv_freq", "model.layers.30.self_attn.k_proj.qweight", "model.layers.30.self_attn.k_proj.qzeros", "model.layers.30.self_attn.k_proj.scales", "model.layers.30.self_attn.o_proj.qweight", "model.layers.30.self_attn.o_proj.qzeros", "model.layers.30.self_a
ttn.o_proj.scales", "model.layers.30.self_attn.q_proj.qweight", "model.layers.30.self_attn.q_proj.qzeros", "model.layers.30.self_attn.q_proj.scales", "model.layers.30.self_attn.v_proj.qweight", "model.layers.30.self_attn.v_proj.qzeros", "model.layers.30.self_attn.v_proj.scales", "model.layers.30.mlp.down_proj.qweight", "model.layers.30.mlp.down_proj.qzeros", "model.layers.30.mlp.down_proj.scales", "model.layers.30.mlp.gate_proj.qweight", "model.layers.30
.mlp.gate_proj.qzeros", "model.layers.30.mlp.gate_proj.scales", "model.layers.30.mlp.up_proj.qweight", "model.layers.30.mlp.up_proj.qzeros", "model.layers.30.mlp.up_proj.scales", "model.layers.30.input_layernorm.weight", "model.layers.30.post_attention_layernorm.weight", "model.layers.31.self_attn.rotary_emb.inv_freq", "model.layers.31.self_attn.k_proj.qweight", "model.layers.31.self_attn.k_proj.qzeros", "model.layers.31.self_attn.k_proj.scales", "model.
layers.31.self_attn.o_proj.qweight", "model.layers.31.self_attn.o_proj.qzeros", "model.layers.31.self_attn.o_proj.scales", "model.layers.31.self_attn.q_proj.qweight", "model.layers.31.self_attn.q_proj.qzeros", "model.layers.31.self_attn.q_proj.scales", "model.layers.31.self_attn.v_proj.qweight", "model.layers.31.self_attn.v_proj.qzeros", "model.layers.31.self_attn.v_proj.scales", "model.layers.31.mlp.down_proj.qweight", "model.layers.31.mlp.down_proj.qze
ros", "model.layers.31.mlp.down_proj.scales", "model.layers.31.mlp.gate_proj.qweight", "model.layers.31.mlp.gate_proj.qzeros", "model.layers.31.mlp.gate_proj.scales", "model.layers.31.mlp.up_proj.qweight", "model.layers.31.mlp.up_proj.qzeros", "model.layers.31.mlp.up_proj.scales", "model.layers.31.input_layernorm.weight", "model.layers.31.post_attention_layernorm.weight", "model.layers.32.self_attn.rotary_emb.inv_freq", "model.layers.32.self_attn.k_proj.
qweight", "model.layers.32.self_attn.k_proj.qzeros", "model.layers.32.self_attn.k_proj.scales", "model.layers.32.self_attn.o_proj.qweight", "model.layers.32.self_attn.o_proj.qzeros", "model.layers.32.self_attn.o_proj.scales", "model.layers.32.self_attn.q_proj.qweight", "model.layers.32.self_attn.q_proj.qzeros", "model.layers.32.self_attn.q_proj.scales", "model.layers.32.self_attn.v_proj.qweight", "model.layers.32.self_attn.v_proj.qzeros", "model.layers.3
2.self_attn.v_proj.scales", "model.layers.32.mlp.down_proj.qweight", "model.layers.32.mlp.down_proj.qzeros", "model.layers.32.mlp.down_proj.scales", "model.layers.32.mlp.gate_proj.qweight", "model.layers.32.mlp.gate_proj.qzeros", "model.layers.32.mlp.gate_proj.scales", "model.layers.32.mlp.up_proj.qweight", "model.layers.32.mlp.up_proj.qzeros", "model.layers.32.mlp.up_proj.scales", "model.layers.32.input_layernorm.weight", "model.layers.32.post_attention
_layernorm.weight", "model.layers.33.self_attn.rotary_emb.inv_freq", "model.layers.33.self_attn.k_proj.qweight", "model.layers.33.self_attn.k_proj.qzeros", "model.layers.33.self_attn.k_proj.scales", "model.layers.33.self_attn.o_proj.qweight", "model.layers.33.self_attn.o_proj.qzeros", "model.layers.33.self_attn.o_proj.scales", "model.layers.33.self_attn.q_proj.qweight", "model.layers.33.self_attn.q_proj.qzeros", "model.layers.33.self_attn.q_proj.scales",
 "model.layers.33.self_attn.v_proj.qweight", "model.layers.33.self_attn.v_proj.qzeros", "model.layers.33.self_attn.v_proj.scales", "model.layers.33.mlp.down_proj.qweight", "model.layers.33.mlp.down_proj.qzeros", "model.layers.33.mlp.down_proj.scales", "model.layers.33.mlp.gate_proj.qweight", "model.layers.33.mlp.gate_proj.qzeros", "model.layers.33.mlp.gate_proj.scales", "model.layers.33.mlp.up_proj.qweight", "model.layers.33.mlp.up_proj.qzeros", "model.l
ayers.33.mlp.up_proj.scales", "model.layers.33.input_layernorm.weight", "model.layers.33.post_attention_layernorm.weight", "model.layers.34.self_attn.rotary_emb.inv_freq", "model.layers.34.self_attn.k_proj.qweight", "model.layers.34.self_attn.k_proj.qzeros", "model.layers.34.self_attn.k_proj.scales", "model.layers.34.self_attn.o_proj.qweight", "model.layers.34.self_attn.o_proj.qzeros", "model.layers.34.self_attn.o_proj.scales", "model.layers.34.self_attn
.q_proj.qweight", "model.layers.34.self_attn.q_proj.qzeros", "model.layers.34.self_attn.q_proj.scales", "model.layers.34.self_attn.v_proj.qweight", "model.layers.34.self_attn.v_proj.qzeros", "model.layers.34.self_attn.v_proj.scales", "model.layers.34.mlp.down_proj.qweight", "model.layers.34.mlp.down_proj.qzeros", "model.layers.34.mlp.down_proj.scales", "model.layers.34.mlp.gate_proj.qweight", "model.layers.34.mlp.gate_proj.qzeros", "model.layers.34.mlp.g
ate_proj.scales", "model.layers.34.mlp.up_proj.qweight", "model.layers.34.mlp.up_proj.qzeros", "model.layers.34.mlp.up_proj.scales", "model.layers.34.input_layernorm.weight", "model.layers.34.post_attention_layernorm.weight", "model.layers.35.self_attn.rotary_emb.inv_freq", "model.layers.35.self_attn.k_proj.qweight", "model.layers.35.self_attn.k_proj.qzeros", "model.layers.35.self_attn.k_proj.scales", "model.layers.35.self_attn.o_proj.qweight", "model.la
yers.35.self_attn.o_proj.qzeros", "model.layers.35.self_attn.o_proj.scales", "model.layers.35.self_attn.q_proj.qweight", "model.layers.35.self_attn.q_proj.qzeros", "model.layers.35.self_attn.q_proj.scales", "model.layers.35.self_attn.v_proj.qweight", "model.layers.35.self_attn.v_proj.qzeros", "model.layers.35.self_attn.v_proj.scales", "model.layers.35.mlp.down_proj.qweight", "model.layers.35.mlp.down_proj.qzeros", "model.layers.35.mlp.down_proj.scales", 
"model.layers.35.mlp.gate_proj.qweight", "model.layers.35.mlp.gate_proj.qzeros", "model.layers.35.mlp.gate_proj.scales", "model.layers.35.mlp.up_proj.qweight", "model.layers.35.mlp.up_proj.qzeros", "model.layers.35.mlp.up_proj.scales", "model.layers.35.input_layernorm.weight", "model.layers.35.post_attention_layernorm.weight", "model.layers.36.self_attn.rotary_emb.inv_freq", "model.layers.36.self_attn.k_proj.qweight", "model.layers.36.self_attn.k_proj.qz
eros", "model.layers.36.self_attn.k_proj.scales", "model.layers.36.self_attn.o_proj.qweight", "model.layers.36.self_attn.o_proj.qzeros", "model.layers.36.self_attn.o_proj.scales", "model.layers.36.self_attn.q_proj.qweight", "model.layers.36.self_attn.q_proj.qzeros", "model.layers.36.self_attn.q_proj.scales", "model.layers.36.self_attn.v_proj.qweight", "model.layers.36.self_attn.v_proj.qzeros", "model.layers.36.self_attn.v_proj.scales", "model.layers.36.m
lp.down_proj.qweight", "model.layers.36.mlp.down_proj.qzeros", "model.layers.36.mlp.down_proj.scales", "model.layers.36.mlp.gate_proj.qweight", "model.layers.36.mlp.gate_proj.qzeros", "model.layers.36.mlp.gate_proj.scales", "model.layers.36.mlp.up_proj.qweight", "model.layers.36.mlp.up_proj.qzeros", "model.layers.36.mlp.up_proj.scales", "model.layers.36.input_layernorm.weight", "model.layers.36.post_attention_layernorm.weight", "model.layers.37.self_attn
.rotary_emb.inv_freq", "model.layers.37.self_attn.k_proj.qweight", "model.layers.37.self_attn.k_proj.qzeros", "model.layers.37.self_attn.k_proj.scales", "model.layers.37.self_attn.o_proj.qweight", "model.layers.37.self_attn.o_proj.qzeros", "model.layers.37.self_attn.o_proj.scales", "model.layers.37.self_attn.q_proj.qweight", "model.layers.37.self_attn.q_proj.qzeros", "model.layers.37.self_attn.q_proj.scales", "model.layers.37.self_attn.v_proj.qweight", "
model.layers.37.self_attn.v_proj.qzeros", "model.layers.37.self_attn.v_proj.scales", "model.layers.37.mlp.down_proj.qweight", "model.layers.37.mlp.down_proj.qzeros", "model.layers.37.mlp.down_proj.scales", "model.layers.37.mlp.gate_proj.qweight", "model.layers.37.mlp.gate_proj.qzeros", "model.layers.37.mlp.gate_proj.scales", "model.layers.37.mlp.up_proj.qweight", "model.layers.37.mlp.up_proj.qzeros", "model.layers.37.mlp.up_proj.scales", "model.layers.37
.input_layernorm.weight", "model.layers.37.post_attention_layernorm.weight", "model.layers.38.self_attn.rotary_emb.inv_freq", "model.layers.38.self_attn.k_proj.qweight", "model.layers.38.self_attn.k_proj.qzeros", "model.layers.38.self_attn.k_proj.scales", "model.layers.38.self_attn.o_proj.qweight", "model.layers.38.self_attn.o_proj.qzeros", "model.layers.38.self_attn.o_proj.scales", "model.layers.38.self_attn.q_proj.qweight", "model.layers.38.self_attn.q
_proj.qzeros", "model.layers.38.self_attn.q_proj.scales", "model.layers.38.self_attn.v_proj.qweight", "model.layers.38.self_attn.v_proj.qzeros", "model.layers.38.self_attn.v_proj.scales", "model.layers.38.mlp.down_proj.qweight", "model.layers.38.mlp.down_proj.qzeros", "model.layers.38.mlp.down_proj.scales", "model.layers.38.mlp.gate_proj.qweight", "model.layers.38.mlp.gate_proj.qzeros", "model.layers.38.mlp.gate_proj.scales", "model.layers.38.mlp.up_proj
.qweight", "model.layers.38.mlp.up_proj.qzeros", "model.layers.38.mlp.up_proj.scales", "model.layers.38.input_layernorm.weight", "model.layers.38.post_attention_layernorm.weight", "model.layers.39.self_attn.rotary_emb.inv_freq", "model.layers.39.self_attn.k_proj.qweight", "model.layers.39.self_attn.k_proj.qzeros", "model.layers.39.self_attn.k_proj.scales", "model.layers.39.self_attn.o_proj.qweight", "model.layers.39.self_attn.o_proj.qzeros", "model.layer
s.39.self_attn.o_proj.scales", "model.layers.39.self_attn.q_proj.qweight", "model.layers.39.self_attn.q_proj.qzeros", "model.layers.39.self_attn.q_proj.scales", "model.layers.39.self_attn.v_proj.qweight", "model.layers.39.self_attn.v_proj.qzeros", "model.layers.39.self_attn.v_proj.scales", "model.layers.39.mlp.down_proj.qweight", "model.layers.39.mlp.down_proj.qzeros", "model.layers.39.mlp.down_proj.scales", "model.layers.39.mlp.gate_proj.qweight", "mode
l.layers.39.mlp.gate_proj.qzeros", "model.layers.39.mlp.gate_proj.scales", "model.layers.39.mlp.up_proj.qweight", "model.layers.39.mlp.up_proj.qzeros", "model.layers.39.mlp.up_proj.scales", "model.layers.39.input_layernorm.weight", "model.layers.39.post_attention_layernorm.weight", "model.norm.weight", "lm_head.weight". 
tonylins commented 1 year ago

Hi @mmaaz60 may I know where the error rises from?

mmaaz60 commented 1 year ago

Hi @tonylins,

I see the model is working now. I was trying to load model from awq_cache instead of quant_cache. However now I am facing an other error. The error seems to be related to context length.

Whenever I try to input the model with longer prompt it gives me CUDA Illegal Memory Access Error.

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I doubt two things are going wrong. First, the context length changed during quantization (as original Vicuna-1.5 model is working perfectly fine with the same prompts). Second, the generated quantized model can't handle the errors related to context length and going corrupted when prompted with a bigger prompt.

I am sharing the detailed steps I used to quantized the model along with my generated quantized model for your reference.

Steps to Quantize Vicuna-13b-1.5

  1. Install AWQ following instructions at https://github.com/mit-han-lab/llm-awq/tree/main#install. Note that I am using CUDA-11.7 with PyTorch 2.0.1+cu117.
  2. Perform AWQ search and save search results.
python -m awq.entry --model_path lmsys/vicuna-13b-v1.5 \
    --w_bit 4 --q_group_size 128 \
    --run_awq --dump_awq awq_cache/vicune-1-5-13b.pt

Note that this will automatically download the vicuna-13b-v1.5. If not, you can clone https://huggingface.co/lmsys/vicuna-13b-v1.5 and set the path accordingly.

  1. Evaluate the AWQ quantized model on WikiText-2.
  2. Generate real quantized weights (INT4).
mkdir quant_cache
python -m awq.entry --model_path lmsys/vicuna-13b-v1.5 \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/vicune-1-5-13b.pt \
    --q_backend real --dump_quant quant_cache/vicuna-13b-1-5-awq-4bit.pt
  1. Load and evaluate the real quantized model.

The generated 4-bit AWQ quantized model can be downloaded from this link.


Looking forward to hearing back from you soon. Thank You.

casper-hansen commented 1 year ago

RuntimeError: CUDA error: an illegal memory access was encountered

Looks like you might be running out of memory. Which GPU are you using to load the model?

EDIT: If you can, try upgrading to CUDA 11.8

mmaaz60 commented 1 year ago

Hi @casperbh96,

I am using A100-40GB GPU, and there is plenty of memory left. The model is using only almost 13GB/40GB. So, its not an out-of-memory issue.

I can give a try to CUDA-11.8, but I am not sure how it would be helpful. Looking forward to your reply. Thanks

mmaaz60 commented 1 year ago

RuntimeError: CUDA error: an illegal memory access was encountered

Looks like you might be running out of memory. Which GPU are you using to load the model?

EDIT: If you can, try upgrading to CUDA 11.8

Hi @casperbh96,

I tried shifting to CUDA-11.8 but facing the same error. Any insights regarding this would be really appreciated. Thanks

casper-hansen commented 1 year ago

I tried shifting to CUDA-11.8 but facing the same error. Any insights regarding this would be really appreciated. Thanks

I am not sure what your specific issue is. Can you please show me the command you use to run TinyChat?

For reference, I just tried to quantize Vicuna 7B 1.5 and run it using TinyChat, and everything worked as expected.

Edit: Try this one maybe: https://huggingface.co/casperhansen/vicuna-7b-v1.5-awq

mmaaz60 commented 1 year ago

Hi @casperbh96,

Thank you for sparing time and preparing awq version of vicuna-1.5-7B model. Unfortunately, I am facing the same issue with the model provided by you. I am listing down the steps below to reproduce the error I am getting.

1. The script I am using for testing

import argparse
import time
import numpy as np
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, modeling_utils
from attributedict.collections import AttributeDict
from tinychat.stream_generators import StreamGenerator, FalconStreamGenerator
from tinychat.utils.load_quant import load_awq_model, load_awq_llama_fast
from tinychat.utils.prompt_templates import get_prompter, get_stop_token_ids

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# opt_params in TinyLLMEngine
gen_params = AttributeDict([
    ("seed", -1),  # RNG seed
    ("n_threads", 1),  # TODO: fix this
    ("n_predict", 512),  # new tokens to predict
    ("n_parts", -1),  # amount of model parts (-1: determine from model dimensions)
    ("n_ctx", 512),  # context size
    ("n_batch", 512),  # batch size for prompt processing (must be >=32 to use BLAS)
    ("n_keep", 0),  # number of tokens to keep from initial prompt
    ("n_vocab", 50272),  # vocabulary size

    # sampling parameters
    ("logit_bias", dict()),  # logit bias for specific tokens: <int, float>
    ("top_k", 40),  # <= 0 to use vocab size
    ("top_p", 0.95),  # 1.0 = disabled
    ("tfs_z", 1.00),  # 1.0 = disabled
    ("typical_p", 1.00),  # 1.0 = disabled
    ("temp", 0.70),  # 1.0 = disabled
    ("repeat_penalty", 1.10),  # 1.0 = disabled
    ("repeat_last_n", 64),  # last n tokens to penalize (0 = disable penalty, -1 = context size)
    ("frequency_penalty", 0.00),  # 0.0 = disabled
    ("presence_penalty", 0.00),  # 0.0 = disabled
    ("mirostat", 0),  # 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
    ("mirostat_tau", 5.00),  # target entropy
    ("mirostat_eta", 0.10),  # learning rate
])

def stream_output(output_stream):
    print(f"ASSISTANT: ", end="", flush=True)
    pre = 0
    for outputs in output_stream:
        output_text = outputs["text"]
        output_text = output_text.strip().split(" ")
        now = len(output_text) - 1
        if now > pre:
            print(" ".join(output_text[pre:now]), end=" ", flush=True)
            pre = now
    print(" ".join(output_text[pre:]), flush=True)
    if "timing" in outputs and outputs["timing"] is not None:
        timing = outputs["timing"]
        context_tokens = timing["context_tokens"]
        context_time = timing["context_time"]
        total_tokens = timing["total_tokens"]
        generation_time_list = timing["generation_time_list"]
        generation_tokens = len(generation_time_list)
        average_speed = (context_time + np.sum(generation_time_list)) / (context_tokens + generation_tokens)
        print("=" * 50)
        print("Speed of Inference")
        print("-" * 50)
        # print(f"Context Stage    : {context_time/context_tokens * 1000:.2f} ms/token")
        print(f"Generation Stage : {np.average(generation_time_list) * 1000:.2f} ms/token")
        # print(f"Average Speed    : {average_speed * 1000:.2f} ms/token")
        print("=" * 50)
        # print("token num:", total_tokens)
        # print("Model total Time = ", (context_time + np.sum(generation_time_list))*1000, "ms" )
    return " ".join(output_text)

def device_warmup(device: str):
    warm_up = torch.randn((4096, 4096)).to(device)
    torch.mm(warm_up, warm_up)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_type', type=str, default='LLaMa', help='type of the model')
    parser.add_argument('--model_path', type=str, default='vicuna-7b-v1.5-awq')
    parser.add_argument('--precision', type=str, default='W4A16', help='compute precision')
    parser.add_argument('--device', type=str, default='cuda')
    parser.add_argument('--q_group_size', type=int, default=128)
    parser.add_argument('--load_quant', type=str, default='vicuna-7b-v1.5-awq/awq_model_w4_g128.pt')

    args = parser.parse_args()
    assert args.model_type.lower() in ["llama", "falcon", "mpt"], "We only support llama & falcon & mpt now"
    assert args.precision in ["W4A16", "W16A16"], "We only support W4A16/W16A16 now"

    gen_params.n_predict = 512
    gen_params.n_vocab = 32000

    def skip(*args, **kwargs):
        pass

    torch.nn.init.kaiming_uniform_ = skip
    torch.nn.init.kaiming_normal_ = skip
    torch.nn.init.uniform_ = skip
    torch.nn.init.normal_ = skip

    config = AutoConfig.from_pretrained(args.model_path, trust_remote_code=True)
    if "mpt" in config.__class__.__name__.lower():
        # config.init_device="meta"
        tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name, trust_remote_code=True)
    else:
        tokenizer = AutoTokenizer.from_pretrained(args.model_path, use_fast=False, trust_remote_code=True)
    modeling_utils._init_weights = False
    torch.set_default_dtype(torch.half)
    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

    if args.precision == "W4A16":
        if args.model_type.lower() == "llama":
            model = load_awq_llama_fast(model, args.load_quant, 4, args.q_group_size, args.device)
        else:
            model = load_awq_model(model, args.load_quant, 4, args.q_group_size, args.device)
    else:
        model = AutoModelForCausalLM.from_pretrained(args.model_path, config=config, torch_dtype=torch.float16,
                                                     trust_remote_code=True).to(args.device)

    # device warm up
    device_warmup(args.device)

    if args.model_type.lower() == 'falcon':
        stream_generator = FalconStreamGenerator
    else:
        stream_generator = StreamGenerator

    # Optimize AWQ quantized model
    if args.precision == "W4A16" and args.model_type.lower() == 'llama':
        from tinychat.modules import make_quant_norm, make_quant_attn, make_fused_mlp

        make_quant_attn(model, args.device)
        make_quant_norm(model)
        make_fused_mlp(model)

    model_prompter = get_prompter(args.model_type, args.model_path)
    stop_token_ids = get_stop_token_ids(args.model_type, args.model_path)
    count = 0
    # Get input from the user
    input_prompt = ("Vicuna, I've been reading extensively about emerging technologies and their impact on modern society, business, and our daily lives. I'm interested in understanding more about the technologies that are on the horizon and how they might change the world as we know it. Specifically, could you provide insights on the current state of Quantum Computing and how it compares to classical computers? How are companies like Google, IBM, and others progressing in their research and development efforts in this area? Another area that has caught my attention is the field of Biotechnology. I'm curious about CRISPR and its potential applications in the real world. How far are we from seeing CRISPR being used to treat genetic diseases in humans on a large scale? And speaking of biotech, what's the latest on lab-grown meat and its implications for global food security? Switching gears a bit, I've also been hearing a lot about the metaverse. Could you describe what it is and how companies are building and capitalizing on it? How will augmented reality (AR) and virtual reality (VR) technologies play into the development and adoption of the metaverse? I've seen some initial applications in gaming, but I'm interested in its broader societal implications. Additionally, what's the potential of decentralized finance (DeFi) and how it's disrupting traditional financial systems? I'm also interested in the development of new battery technologies, as there's a lot of talk about how they might revolutionize transportation and renewable energy storage. What are solid-state batteries and how do they differ from the lithium-ion batteries that we commonly use today? On the topic of energy, are there any groundbreaking innovations in nuclear fusion that could make it a viable and sustainable energy source in the near future? Another area I'd like to touch on is the field of Artificial Intelligence (AI). While I understand the basics, I'm curious about the advancements in neural network architectures beyond what we know as transformers. Are there new paradigms that researchers are exploring? How is AI being applied in unexpected or novel ways in different industries? Furthermore, there's been a lot of discussion about the ethical implications of these emerging technologies. How are companies, researchers, and policymakers navigating the challenges associated with ensuring that these technologies are developed and deployed responsibly? What are the major ethical considerations for technologies like facial recognition, deepfakes, and AI in general? Lastly, could you touch on the role of space exploration in the coming decades? I've been following the progress of companies like SpaceX and Blue Origin and am interested in understanding how space technologies might shape our future, both in terms of exploration and potential colonization efforts. With the above in mind, I'd appreciate a comprehensive overview of these topics, as well as any other emerging technologies you believe are poised to have a significant impact in the next decade."
                    "Vicuna, I've been reading extensively about emerging technologies and their impact on modern society, business, and our daily lives. I'm interested in understanding more about the technologies that are on the horizon and how they might change the world as we know it. Specifically, could you provide insights on the current state of Quantum Computing and how it compares to classical computers? How are companies like Google, IBM, and others progressing in their research and development efforts in this area? Another area that has caught my attention is the field of Biotechnology. I'm curious about CRISPR and its potential applications in the real world. How far are we from seeing CRISPR being used to treat genetic diseases in humans on a large scale? And speaking of biotech, what's the latest on lab-grown meat and its implications for global food security? Switching gears a bit, I've also been hearing a lot about the metaverse. Could you describe what it is and how companies are building and capitalizing on it? How will augmented reality (AR) and virtual reality (VR) technologies play into the development and adoption of the metaverse? I've seen some initial applications in gaming, but I'm interested in its broader societal implications. Additionally, what's the potential of decentralized finance (DeFi) and how it's disrupting traditional financial systems? I'm also interested in the development of new battery technologies, as there's a lot of talk about how they might revolutionize transportation and renewable energy storage. What are solid-state batteries and how do they differ from the lithium-ion batteries that we commonly use today? On the topic of energy, are there any groundbreaking innovations in nuclear fusion that could make it a viable and sustainable energy source in the near future? Another area I'd like to touch on is the field of Artificial Intelligence (AI). While I understand the basics, I'm curious about the advancements in neural network architectures beyond what we know as transformers. Are there new paradigms that researchers are exploring? How is AI being applied in unexpected or novel ways in different industries? Furthermore, there's been a lot of discussion about the ethical implications of these emerging technologies. How are companies, researchers, and policymakers navigating the challenges associated with ensuring that these technologies are developed and deployed responsibly? What are the major ethical considerations for technologies like facial recognition, deepfakes, and AI in general? Lastly, could you touch on the role of space exploration in the coming decades? I've been following the progress of companies like SpaceX and Blue Origin and am interested in understanding how space technologies might shape our future, both in terms of exploration and potential colonization efforts. With the above in mind, I'd appreciate a comprehensive overview of these topics, as well as any other emerging technologies you believe are poised to have a significant impact in the next decade."
                    "Vicuna, I've been reading extensively about emerging technologies and their impact on modern society, business, and our daily lives. I'm interested in understanding more about the technologies that are on the horizon and how they might change the world as we know it. Specifically, could you provide insights on the current state of Quantum Computing and how it compares to classical computers? How are companies like Google, IBM, and others progressing in their research and development efforts in this area? Another area that has caught my attention is the field of Biotechnology. I'm curious about CRISPR and its potential applications in the real world. How far are we from seeing CRISPR being used to treat genetic diseases in humans on a large scale? And speaking of biotech, what's the latest on lab-grown meat and its implications for global food security? Switching gears a bit, I've also been hearing a lot about the metaverse. Could you describe what it is and how companies are building and capitalizing on it? How will augmented reality (AR) and virtual reality (VR) technologies play into the development and adoption of the metaverse? I've seen some initial applications in gaming, but I'm interested in its broader societal implications. Additionally, what's the potential of decentralized finance (DeFi) and how it's disrupting traditional financial systems? I'm also interested in the development of new battery technologies, as there's a lot of talk about how they might revolutionize transportation and renewable energy storage. What are solid-state batteries and how do they differ from the lithium-ion batteries that we commonly use today? On the topic of energy, are there any groundbreaking innovations in nuclear fusion that could make it a viable and sustainable energy source in the near future? Another area I'd like to touch on is the field of Artificial Intelligence (AI). While I understand the basics, I'm curious about the advancements in neural network architectures beyond what we know as transformers. Are there new paradigms that researchers are exploring? How is AI being applied in unexpected or novel ways in different industries? Furthermore, there's been a lot of discussion about the ethical implications of these emerging technologies. How are companies, researchers, and policymakers navigating the challenges associated with ensuring that these technologies are developed and deployed responsibly? What are the major ethical considerations for technologies like facial recognition, deepfakes, and AI in general? Lastly, could you touch on the role of space exploration in the coming decades? I've been following the progress of companies like SpaceX and Blue Origin and am interested in understanding how space technologies might shape our future, both in terms of exploration and potential colonization efforts. With the above in mind, I'd appreciate a comprehensive overview of these topics, as well as any other emerging technologies you believe are poised to have a significant impact in the next decade.")
    model_prompter.insert_prompt(input_prompt)
    output_stream = stream_generator(model, tokenizer, model_prompter.model_input, gen_params, device=args.device,
                                     stop_token_ids=stop_token_ids)
    outputs = stream_output(output_stream)
    model_prompter.update_template(outputs)
    count += 1

Note that I just removed the while loop and instead hard-coded the prompt.

2. I run the above script for standard Vicuna-7b-1.5 from lmsys/vicuna-7b-v1.5 using the following command. The program runs successfully without any error and I could see the response printed on the terminal.

python demo_sample.py --model_type llama --model_path lmsys/vicuna-7b-v1.5 --precision W16A16

3. I tried using your provided AWQ-4bit model with the same script using the following command,

python demo_sample.py --model_type llama --model_path vicuna-7b-v1.5-awq --q_group_size 128 --load_quant vicuna-7b-v1.5-awq/awq_model_w4_g128.pt --precision W4A16

here I got the error as listed below,

ASSISTANT: Traceback (most recent call last):
  File "llm-awq/tinychat/demo.py", line 151, in <module>
    outputs = stream_output(output_stream)
  File "llm-awq/tinychat/demo.py", line 47, in stream_output
    for outputs in output_stream:
  File "awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
    response = gen.send(request)
  File "llm-awq/tinychat/stream_generators/stream_gen.py", line 70, in StreamGenerator
    out = model(
  File "awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "awq/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 824, in forward
    logits = self.lm_head(hidden_states)
  File "awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "awq/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

So, my understanding is that quantization is somehow reducing the original context size of the model. Note that Vicuna-1.5 models have 4K context size. Secondly, the quantized model is producing cuda execution error error instead of handling the context-size related exception. It may be because the context length read from config is still 4K however the model has a reduced context length.

Please correct me if I am wrong. Looking forward to your reply and a potential solution of the issue. Thank You.

casper-hansen commented 1 year ago

It looks like you are trying to modify demo.py and I can't be sure of exactly what is going on.

I have been working on a refactoring of AWQ. Can you try testing against #72 instead of modifying anything?

git clone https://github.com/casperbh96/llm-awq.git
cd llm-awq
git checkout refactor-models
pip install -e .
cd awq/kernels
python setup.py install
cd ../../tinychat

Then you should simply be able to run this:

python demo.py --model_path casperhansen/vicuna-7b-v1.5-awq
mmaaz60 commented 1 year ago

Hi @casperbh96,

Thank you for your prompt reply and I really appreciate the efforts you putting in refactoring the code. However, please note that the error can be reproduce using the ORIGINAL demo.py code as well. Please input the prompt I am listing below 2-3 times in the same run and you should be able to reproduce the error.

Vicuna, I have been reading extensively about emerging technologies and their impact on modern society, business, and our daily lives. I am interested in understanding more about the technologies that are on the horizon and how they might change the world as we know it. Specifically, could you provide insights on the current state of Quantum Computing and how it compares to classical computers? How are companies like Google, IBM, and others progressing in their research and development efforts in this area? Another area that has caught my attention is the field of Biotechnology. I am curious about CRISPR and its potential applications in the real world. How far are we from seeing CRISPR being used to treat genetic diseases in humans on a large scale? And speaking of biotech, what is the latest on lab-grown meat and its implications for global food security? Switching gears a bit, I have also been hearing a lot about the metaverse. Could you describe what it is and how companies are building and capitalizing on it? How will augmented reality (AR) and virtual reality (VR) technologies play into the development and adoption of the metaverse? I have seen some initial applications in gaming, but I am interested in its broader societal implications. Additionally, what is the potential of decentralized finance (DeFi) and how it is disrupting traditional financial systems? I am also interested in the development of new battery technologies, as there is a lot of talk about how they might revolutionize transportation and renewable energy storage. What are solid-state batteries and how do they differ from the lithium-ion batteries that we commonly use today? On the topic of energy, are there any groundbreaking innovations in nuclear fusion that could make it a viable and sustainable energy source in the near future? Another area I would like to touch on is the field of Artificial Intelligence (AI). While I understand the basics, I am curious about the advancements in neural network architectures beyond what we know as transformers. Are there new paradigms that researchers are exploring? How is AI being applied in unexpected or novel ways in different industries? Furthermore, there has been a lot of discussion about the ethical implications of these emerging technologies. How are companies, researchers, and policymakers navigating the challenges associated with ensuring that these technologies are developed and deployed responsibly? What are the major ethical considerations for technologies like facial recognition, deepfakes, and AI in general? Lastly, could you touch on the role of space exploration in the coming decades? I have been following the progress of companies like SpaceX and Blue Origin and am interested in understanding how space technologies might shape our future, both in terms of exploration and potential colonization efforts. With the above in mind, I would appreciate a comprehensive overview of these topics, as well as any other emerging technologies you believe are poised to have a significant impact in the next decade.

demo.py keeps the history so every consecutive prompt is adding to the maximum context length it can support and finally you will be able to see the error when the context length will increase from 2K. Please note that for some reason the input_prompt = input("USER: ") command is unable to read a big prompt at one go and that's why we have to input the same prompt in the same run a couple of time to reach to a level where we can reproduce the error.

Also please note that this error will not trigger with FP16 model.

I would really appreciate if you can try it and confirm if you can reproduce the error. Thank You.

casper-hansen commented 1 year ago

demo.py keeps the history so every consecutive prompt is adding to the maximum context length it can support and finally you will be able to see the error when the context length will increase from 2K. Please note that for some reason the input_prompt = input("USER: ") command is unable to read a big prompt at one go and that's why we have to input the same prompt in the same run a couple of time to reach to a level where we can reproduce the error.

Also please note that this error will not trigger with FP16 model.

I would really appreciate if you can try it and confirm if you can reproduce the error. Thank You.

I am not sure I completely understand your problem here. Perhaps the problem is that you are trying to use the model in a way that it was not designed for, like going over the maximum context length?

mmaaz60 commented 1 year ago

demo.py keeps the history so every consecutive prompt is adding to the maximum context length it can support and finally you will be able to see the error when the context length will increase from 2K. Please note that for some reason the input_prompt = input("USER: ") command is unable to read a big prompt at one go and that's why we have to input the same prompt in the same run a couple of time to reach to a level where we can reproduce the error. Also please note that this error will not trigger with FP16 model. I would really appreciate if you can try it and confirm if you can reproduce the error. Thank You.

I am not sure I completely understand your problem here. Perhaps the problem is that you are trying to use the model in a way that it was not designed for, like going over the maximum context length?

Hi @casperbh96

I am not going over the context length. What I am trying to convey is that quantized model has lower context length than the original model.

Specifically,

And what I want is a way to quantize the original model keeping the context length same as 4096. Please let me know if you have any questions. Thanks

casper-hansen commented 1 year ago

I will investigate this in the future. You should be able to keep same context length without problems, maybe it's just something to do with how the config is being saved.

casper-hansen commented 1 year ago

I have now investigated what is happening. Huggingface transformers/accelerate is not automatically loading the maximum sequence length into the model, causing some problems. I will aim to solve this in the upcoming #72