lenskit / lkpy

Python recommendation toolkit
https://lkpy.lenskit.org
MIT License
270 stars 61 forks source link

High-performance recommender output storage #495

Open mdekstrand opened 3 weeks ago

mdekstrand commented 3 weeks ago

Right now, in experiments I have been running, there is a significant bottleneck in retrieving and saving results in parallel batch inference. This is significantly hindering throughput, as each worker is only able to run at 30-40% of a CPU on my large data-crunching rig.

It is possible that item lists will speed this up, but if not, I would like to look at a more efficient way to collect batch-inference results for saving and/or measurement.

One potential solution is to save each worker's results in a separate Parquet file.

Another promising direction is Arrow Flight, an IPC protocol built on top of Arrow. ItemList can be trivially converted to an Arrow Table, which then can be serialized into a flight. We could implement a Flight server, in either Python or Rust, that processes item lists and incorporates them into the results.

Some open questions:

mdekstrand commented 3 weeks ago

I have done a quick benchmark, and serializing an item list to PyArrow IPC is not more efficient than pickling it.

That is with short lists; when I let lists get longer, the gap increases.