Open FurryMushroom opened 11 months ago
I am also testing OPT 6.7B model on its FP16 version and smoothquant version provided by Han's lab.
First, met the following warning and later model accuracy is 0.0
Some weights of the model checkpoint at mit-han-lab/opt-6.7b-smoothquant were not used when initializing Int8OPTForCausalLM: ['model.decoder.layers.5.fc2.a', 'model.decoder.layers.14.fc2.a', 'model.decoder.layers.18.self_attn.out_proj.a', 'model.decoder.layers.31.self_attn.out_proj.a', 'model.decoder.layers.9.self_attn.out_proj.a', 'model.decoder.layers.29.self_attn.out_proj.a', 'model.decoder.layers.27.fc2.a', 'model.decoder.layers.4.fc2.a', 'model.decoder.layers.1.fc2.a', 'model.decoder.layers.6.fc2.a', 'model.decoder.layers.15.fc2.a', 'model.decoder.layers.3.self_attn.out_proj.a', 'model.decoder.layers.22.self_attn.out_proj.a', 'model.decoder.layers.9.fc2.a', ...
Second, on one A100 card, latency does not decrease. 45.581
ms (FP16) to 54.93
ms (smoothquant)
Is it possible that the first warning/error can increase the latency we observed on smoothquant model?
If it's because it's just like this and it will achieve higher acceleration on opt-30B model?