Are there any optimization methods to support the optimized run of this project?

netease-youdao / BCEmbedding

Netease Youdao's open-source embedding and reranker models for RAG products.

Apache License 2.0

1.3k stars 85 forks source link

Are there any optimization methods to support the optimized run of this project? #54

Closed wencan closed 2 months ago

wencan commented 2 months ago

For example, can you provide a quantized version of the model?

I am running bce-embedding-base_v1 and bce-reranker-base_v1 using Xinference, and it takes up more than 4GB of memory. This is roughly the entire memory space of a low-end server.

shenlei1020 commented 2 months ago

Thank you for yur insterets!

bce-embedding-base_v1 and bce-reranker-base_v1have model size in "bert base", which are very suitable in practice.
For effciency, you can run models in fp16 mode by onnxruntime-gpu, which maybe need 2G for each model. See more in qanything project. I am not clear that which one is more efficient of int8 model by trt and fp16 by onnxruntime-gpu, owing to batching and padding.