michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
https://michaelfeil.github.io/infinity/
MIT License
1.06k stars 75 forks source link

Idea: add a parameter to configure number of decimals in JSON output #64

Open lasttero opened 6 months ago

lasttero commented 6 months ago

Please consider adding a parameter to set the number of decimals in the Json output. This would be beneficial to reduce network bandwidth requirements and the time for parsing the output. This is relevant for users who do not need/want full accuracy e.g. is the embedding values are quantized and/or have a latency critical applications.

michaelfeil commented 6 months ago

Good idea, I assume as the payload is stringified and sent as payload.

On the other hand, json encoding took around 20% of the CPU, in some cases was responsible for up to half the share of latency time. I solved the issue by switching to orjson. I do not think that https://github.com/ijl/orjson supports such a feature.

So pro:

Con:

lasttero commented 6 months ago

Thank you for responding quickly. Inspired by the comment above I realized I had a sub-optimal implementation for JSON parsing, and replaced it with hand-coded parser for the fastest processing. It would be beneficial to have this, but not anymore critical. Backgrounder: we run a number of infinity processes locally on the same GPU (as that seem to stochastically interleave GPU usage to maximize GPU utilization and total throughput). Again, thank you for the convenient application.

michaelfeil commented 5 months ago

I slightly optimized queueing - I don't think the decimals in the json would significantly influence the throughput.