Open mdekstrand opened 3 weeks ago
I have done a quick benchmark, and serializing an item list to PyArrow IPC is not more efficient than pickling it.
HIGHEST_PROTOCOL
): 61msThat is with short lists; when I let lists get longer, the gap increases.
Right now, in experiments I have been running, there is a significant bottleneck in retrieving and saving results in parallel batch inference. This is significantly hindering throughput, as each worker is only able to run at 30-40% of a CPU on my large data-crunching rig.
It is possible that item lists will speed this up, but if not, I would like to look at a more efficient way to collect batch-inference results for saving and/or measurement.
One potential solution is to save each worker's results in a separate Parquet file.
Another promising direction is Arrow Flight, an IPC protocol built on top of Arrow.
ItemList
can be trivially converted to an ArrowTable
, which then can be serialized into a flight. We could implement a Flight server, in either Python or Rust, that processes item lists and incorporates them into the results.Some open questions:
do_put
block other clients?