Open ojasDM opened 8 months ago
Try int4 quantization with neural_compressor?
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
I am using the Int8 quantized version of BGE-reranker-base model converted to the Onnx model. I am processing the inputs in batches. Now the scenario is that I am experiencing a latency of 20-30 secs with the original model. With the int8 quantized and onnx optimized model, the latency was reduced to 8-15 secs keeping all the configurations the same like hardware, batch processing, and everything I used with the original torch model. I am using Flask as an API server, on a quad-core machine. I want further to reduce the model latency of the Onnx model. How can I do so? Also please suggest anything more I can do during the deployment
To reproduce
--
Urgency
Urgent
Platform
Linux
OS Version
22.04
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.15.0
ONNX Runtime API
Python
Architecture
X86
Execution Provider
Default CPU
Execution Provider Library Version
12+
Model File
No response
Is this a quantized model?
Yes