Closed fdlci closed 3 years ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi,
I have been comparing inference speeds between pytorch models and their ONNX versions. To convert a model from pytorch to ONNX I have used the code your provided in convert_graph_to_onnx.py.
I have built my onnx model as follows as I am applying it to QA: python transformers/src/transformers/convert_graph_to_onnx.py --framework pt --model Camembert-base-ccnet-fquad11 --quantize cam_onnx/camembert-base.onnx --pipeline 'question-answering'
This code outputs 3 models, camembert-base.onnx, camembert-base-optimized.onnx, camembert-base-optimized-quantize.onnx.
I run inference with the three models and I was expecting the quantize version to be much faster than the camembert-base.onnx, but it was the complete opposite. I don't understand why quantization doesn't increase the speedup in this case?
Thank you for your answer!