Inference time decreases only by 7.5% on opt-6.7B

I am also testing OPT 6.7B model on its FP16 version and smoothquant version provided by Han's lab.

First, met the following warning and later model accuracy is 0.0

Some weights of the model checkpoint at mit-han-lab/opt-6.7b-smoothquant were not used when initializing Int8OPTForCausalLM: ['model.decoder.layers.5.fc2.a', 'model.decoder.layers.14.fc2.a', 'model.decoder.layers.18.self_attn.out_proj.a', 'model.decoder.layers.31.self_attn.out_proj.a', 'model.decoder.layers.9.self_attn.out_proj.a', 'model.decoder.layers.29.self_attn.out_proj.a', 'model.decoder.layers.27.fc2.a', 'model.decoder.layers.4.fc2.a', 'model.decoder.layers.1.fc2.a', 'model.decoder.layers.6.fc2.a', 'model.decoder.layers.15.fc2.a', 'model.decoder.layers.3.self_attn.out_proj.a', 'model.decoder.layers.22.self_attn.out_proj.a', 'model.decoder.layers.9.fc2.a', ...

Second, on one A100 card, latency does not decrease. 45.581 ms (FP16) to 54.93 ms (smoothquant)

Is it possible that the first warning/error can increase the latency we observed on smoothquant model?

mit-han-lab / smoothquant

Inference time decreases only by 7.5% on opt-6.7B #62