mit-han-lab / smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
https://arxiv.org/abs/2211.10438
MIT License
1.2k stars 138 forks source link

Inference time decreases only by 7.5% on opt-6.7B #62

Open FurryMushroom opened 11 months ago

FurryMushroom commented 11 months ago

If it's because it's just like this and it will achieve higher acceleration on opt-30B model?

leocnj commented 9 months ago

I am also testing OPT 6.7B model on its FP16 version and smoothquant version provided by Han's lab.

First, met the following warning and later model accuracy is 0.0

Some weights of the model checkpoint at mit-han-lab/opt-6.7b-smoothquant were not used when initializing Int8OPTForCausalLM: ['model.decoder.layers.5.fc2.a', 'model.decoder.layers.14.fc2.a', 'model.decoder.layers.18.self_attn.out_proj.a', 'model.decoder.layers.31.self_attn.out_proj.a', 'model.decoder.layers.9.self_attn.out_proj.a', 'model.decoder.layers.29.self_attn.out_proj.a', 'model.decoder.layers.27.fc2.a', 'model.decoder.layers.4.fc2.a', 'model.decoder.layers.1.fc2.a', 'model.decoder.layers.6.fc2.a', 'model.decoder.layers.15.fc2.a', 'model.decoder.layers.3.self_attn.out_proj.a', 'model.decoder.layers.22.self_attn.out_proj.a', 'model.decoder.layers.9.fc2.a', ...

Second, on one A100 card, latency does not decrease. 45.581 ms (FP16) to 54.93 ms (smoothquant)

Is it possible that the first warning/error can increase the latency we observed on smoothquant model?