For example, can you provide a quantized version of the model?
I am running bce-embedding-base_v1 and bce-reranker-base_v1 using Xinference, and it takes up more than 4GB of memory. This is roughly the entire memory space of a low-end server.
bce-embedding-base_v1 and bce-reranker-base_v1have model size in "bert base", which are very suitable in practice.
For effciency, you can run models in fp16 mode by onnxruntime-gpu, which maybe need 2G for each model. See more in qanything project. I am not clear that which one is more efficient of int8 model by trt and fp16 by onnxruntime-gpu, owing to batching and padding.
For example, can you provide a quantized version of the model?
I am running bce-embedding-base_v1 and bce-reranker-base_v1 using Xinference, and it takes up more than 4GB of memory. This is roughly the entire memory space of a low-end server.