microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.46k stars 2.9k forks source link

onnx slower than original pytorch ner model? #8957

Open Hap-Zhang opened 3 years ago

Hap-Zhang commented 3 years ago

Hi,all

I want to accelerate the inference of NER model whose backend is bert. When i export onnx model with torch.onnx.export, i find that the inference of onnx model(505ms) is slower than original pytorch model(337ms), i'm not sure what went wrong?

The version of Pytorch is 1.9.0. The version of transformer is 4.6.1

The code lists as below:

inputs = {'input_ids':torch.randint(32, [2, 32], dtype=torch.long).to(device), 'attention_mask':torch.ones([2, 32],dtype=torch.long).to(device), 'token_type_ids':torch.ones([2, 32],dtype=torch.long).to(device),}
onnx_model_path = "ner-model.onnx"
with open(onnx_model_path, 'wb') as outf:
    torch.onnx.export(model,args=(inputs['input_ids'], inputs['attention_mask'],inputs['token_type_ids']), f=outf, input_names=['input_ids', 'attention_mask', 'token_type_ids'], opset_version=11, do_constant_folding=True, output_names=['output'],dynamic_axes={'input_ids': [0, 1], 'attention_mask': [0, 1], 'token_type_ids': [0, 1]})
options = SessionOptions()
session = InferenceSession(onnx_model_path, options, providers=['CPUExecutionProvider'])
inputs_onnx = {'input_ids': input_ids.cpu().numpy(), 'attention_mask': attention_mask.cpu().numpy(), 'token_type_ids': token_type_ids.cpu().numpy()}
###########Warm up############
outputs = session.run(None, inputs_onnx)
##############################
start_time2 = time.time()
outputs = session.run(None, inputs_onnx)
end_time2 = time.time()
tianleiwu commented 3 years ago

@Hap-Zhang, could you try model optimization like

...
from onnxruntime.transformers.optimizer import optimize_model
onnx_model = optimize_model(onnx_model_path)
new_onnx_path = "ner_opt_model.onnx"
onnx_model.save_model_to_file(new_onnx_path)
options = SessionOptions()
session = InferenceSession(new_onnx_path, options, providers=['CPUExecutionProvider'])
...

Check whether there is "Attention" nodes in the new onnx file. The performance might be impacted if "Attention" is not fused.

For more information, please refer to the notebook: https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb

Hap-Zhang commented 3 years ago

@tianleiwu Thank you for your quickly reply. I try the model optimization as you said, and the inference time from 505ms to 359ms. However, the inference time of original pytorch model reduces to 337ms by export MKL_CBWR=AVX2, but nothing changed in onnx model whether using AVX2 or not. The backend of onnx is MKL or something else?

tianleiwu commented 3 years ago

The CPU execution provider does not use MKL. It uses MLAS. As far as I know, MLAS could leverage AVX2 in some situation (Like Windows, x64 and quantized model): https://github.com/microsoft/onnxruntime/blob/0cc29095733cecc55efe0b8e0d8ff7cd2a9e427a/cmake/onnxruntime_mlas.cmake#L113-L114

zakki commented 3 years ago

I'm using ResNet & CPU execution provider & C++. ONNX runtime's default setting is 6x slower than PyTorch. options.AddConfigEntry(kOrtSessionOptionsConfigSetDenormalAsZero, "1"); solved that.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.