ONNX models are cumbersome, hard to maintain, and the APIs for them via Hugging Face optimum and transformers keep changing. Not to mention, ONNX hasn't been supported on Python 3.11 for a while now, so it's not something that makes sense to rely on long term. In most cases, it anyway becomes necessary to run sbert on multiple GPUs -- these can be scaled up much more effectively via Ray in a production scenario, so it's better to focus efforts on that instead of using ONNX and quantization for speedups.
ONNX models are cumbersome, hard to maintain, and the APIs for them via Hugging Face
optimum
andtransformers
keep changing. Not to mention, ONNX hasn't been supported on Python 3.11 for a while now, so it's not something that makes sense to rely on long term. In most cases, it anyway becomes necessary to runsbert
on multiple GPUs -- these can be scaled up much more effectively via Ray in a production scenario, so it's better to focus efforts on that instead of using ONNX and quantization for speedups.