[Performance] BGE Reranker / BERT Crossencoder Onnx model latency issue

ojasDM commented 8 months ago

Describe the issue

I am using the Int8 quantized version of BGE-reranker-base model converted to the Onnx model. I am processing the inputs in batches. Now the scenario is that I am experiencing a latency of 20-30 secs with the original model. With the int8 quantized and onnx optimized model, the latency was reduced to 8-15 secs keeping all the configurations the same like hardware, batch processing, and everything I used with the original torch model. I am using Flask as an API server, on a quad-core machine. I want further to reduce the model latency of the Onnx model. How can I do so? Also please suggest anything more I can do during the deployment

To reproduce

--

Urgency

Urgent

Platform

Linux

OS Version

22.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.15.0

ONNX Runtime API

Python

Architecture

X86

Execution Provider

Default CPU

Execution Provider Library Version

12+

Model File

No response

Is this a quantized model?

Yes

tianleiwu commented 8 months ago

Try int4 quantization with neural_compressor?

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

microsoft / onnxruntime