microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.41k stars 2.89k forks source link

optimizing bart model lead to performance degradation #8601

Closed leoozy closed 3 years ago

leoozy commented 3 years ago

I eport the hugging face bart model with torch.export() and get an onnx model.

And then optimize it with:

onnxpath="/home/sysadmin/downlaod/onnx_models/superreducednmbart/model.onnx" modelpath=/home/sysadmin/downlaod/onnx_models/optmbarts.onnx python -m onnxruntime.transformers.optimizer --input ${onnxpath} --output ${modelpath} --float16 --use_gpu --opt_level 99 --model_type gpt2 --input_int32

And inference it with:

def evaluate_onnx(model_path):

execution_providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
sess_options = onnxruntime.SessionOptions()
sess_options.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
session = onnxruntime.InferenceSession(model_path, sess_options, providers=execution_providers)
assert 'CUDAExecutionProvider' in session.get_providers()
for i in range(20):
    start = time.time()
    session.run(None, input_ort)
    latency1.append(time.time() - start)
print(latency1)

the version of onnxruntime and pytorch are all the latest. The platform is A100 + linux.

latency for hugging face model : 0.02s latency for optimized_onnx model: 0,04s

leoozy commented 3 years ago

I noticed that when the sequence length is more than 200, the optimized_onnx is faster. But for larger sequence lengh, the hugging face model is larger