Open lasttero opened 6 months ago
Good idea, I assume as the payload is stringified and sent as payload.
On the other hand, json encoding took around 20% of the CPU, in some cases was responsible for up to half the share of latency time. I solved the issue by switching to orjson
. I do not think that https://github.com/ijl/orjson supports such a feature.
So pro:
Con:
orjson
afaikThank you for responding quickly. Inspired by the comment above I realized I had a sub-optimal implementation for JSON parsing, and replaced it with hand-coded parser for the fastest processing. It would be beneficial to have this, but not anymore critical. Backgrounder: we run a number of infinity processes locally on the same GPU (as that seem to stochastically interleave GPU usage to maximize GPU utilization and total throughput). Again, thank you for the convenient application.
I slightly optimized queueing - I don't think the decimals in the json would significantly influence the throughput.
Please consider adding a parameter to set the number of decimals in the Json output. This would be beneficial to reduce network bandwidth requirements and the time for parsing the output. This is relevant for users who do not need/want full accuracy e.g. is the embedding values are quantized and/or have a latency critical applications.