michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
https://michaelfeil.github.io/infinity/
MIT License
1.06k stars 75 forks source link

Adding max token budget per batch #87

Open michaelfeil opened 5 months ago

michaelfeil commented 5 months ago

Currently allowing up to batch_size=64 as default. This can potentially lead to high memory usage, e.g. for jina-8k bert -> 64x8192. It would be better to adjust dynamically and set a token budget, e.g. 64*512=32768 per forward pass.