modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 649 forks source link

Scaling *down* instead of up: Modin DataFrame running on top of Python dicts. #4337

Open ARDivekar opened 2 years ago

ARDivekar commented 2 years ago

A clear and concise description of what the problem is.

Concisely: if you have data as a list of dicts or jsons then Pandas is quite slow. This is an extremely common problem when data is sent to HTTP servers for processing, where the input is often a JSON representing a serialized Java/Python data-object. If you have <10 rows, it takes a significant number of milliseconds to to just create a Pandas Dataframe, and further processing is slow. In comparison, it is much faster to perform operations on a list of python dictionaries.

At the same time, the Pandas DataFrame API is extremely popular, due to its convenience. It would be excellent to have a dataframe API which internally operates on a list of Python dicts (or a similarly fast format, such as numpy structs). The evolution of this feature could be integrating tools that speed up low-latency processing on <10 rows, such as integration with numba for jit-compilation, etc.

Q: Why is modin the right place for this?

The Modin project's primary aim is to allow data scientists to forget about how their data is laid out, and just focus on writing code in Pandas. So far, it has done an excellent job of doing scaling up, but when dealing with low-latency data processing, there is no convenient library to do this, which in turn means it falls to the data scientist to pick up the slack. Low-latency data processing is critical for many large companies which host ML services (including my employer, Amazon, but also Google, Meta, Microsoft...any project where data scientists needs to preprocess data in complicated ways, and then host that preprocessing code behind a web-server).

Q: Is this feature feasible?

I have personally worked on some code for this, and I know it is possible (not sharing at the moment due to NDA, but might share later if there is interest for this feature and I can get internal approvals). In cases where Pandas API functions are not implemented, it can fallback on the Pandas implementation by converting a list-of-dicts to Pandas and back (which, from my understanding, is how Modin currently operates for backend integrations with Dask/Ray/etc).

TL;DR

To summarize, it makes sense to have a Pandas-like API for which runs on dicts, and have it integrated into Modin. I have so far not found a similar project which does this.

devin-petersohn commented 2 years ago

Hi @ARDivekar thanks for posting!

This is a really interesting proposal. From my understanding, your proposal is for latency-sensitive web application using small datasets, where every millisecond counts. You would like to use Modin in the context of tiny data and have an extremely optimized implementation that operates directly on the Python objects.

Some questions/comments to help the conversation:

ARDivekar commented 2 years ago

Modin is not currently optimized for ultra low-latency. Some error checking may need to be bypassed in order to get the best performance.

It depends on the implementation. The DataFrame structure I mentioned will not call Pandas for any implemented methods, but rather will process data itself using a list/numpy array of Python dicts. The same checks will not be needed.

Why 10 rows? Why not 100 or 1000 rows?

I used 10 rows as an example, but you're right, the same logic extends to <1,000 rows (this is as far as we have tested internally, beyond this Pandas starts to have the advantage).

jit compilation may not really be useful at 10 rows since the compile times are best amortized over lots of data, but it would still be interesting to look into

From my understanding, the Numba JIT compilation runs once per interpreter-creation. When serving an ML model behind an endpoint, the interpreter is usually active for weeks or months at a time, and serves millions of batches of data (with batch size between 1-1,000 rows, typically). Only the very first batch would be slow due to JIT lag.

ARDivekar commented 2 years ago

Could the #performance label be added to this issue? @devin-petersohn

vnlitvinov commented 2 years ago

This could make a nice addition to "small dataframe" query compiler idea, cc @mvashishtha @naren-ponder