modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.87k stars 651 forks source link

Why did Modin perform worse than pandas on a 16 GB dataset? #4798

Closed peterbolson closed 2 years ago

peterbolson commented 2 years ago

Rob Mulla, in this YouTube video, benchmarked pandas vs. Modin vs. Ray vs. Dask vs. Vaex on a 16 gigabyte Reddit Place dataset (which I believe we can pull and clean using three notebooks here: https://github.com/RobMulla/twitch-stream-projects/tree/main/026-reddit-place-data/notebooks, but I will check with Rob).

Two Modin-relevant things came out of this:

OS: Unknown (Rob will have to fill this in) Cores: It looks like there are 64 Memory: I'm not sure, but above 16 GB (I'll have to ask Rob) Modin version: 0.14.0 Pandas version: 1.4.2

Two questions: The first and most important: Why, on a dataset this large, isn't Modin performing better than pandas (except for on cumsum)? Does it have something to do with the fact that there are only five columns? Is parallelization really happening? (I couldn't tell, but from some of the brief screen shares of the Ray worker dashboard, I couldn't see whether more than one core was active.) Does Dask do better? It seems like this should be an instance where parallelization would lead to fairly substantial returns, particularly because Rob's machine appears to have 64 cores.

The second question is much more narrow -- Why did Rob get an error when he ran .head()?

I'll try to follow up on this by getting you the dataset Rob used, either by running the code in his personal repo, or by seeing if he can give us a link to the prepared data. For the exact commands Rob used, we can check out the Modin section of the YouTube video.

Thank you!

mvashishtha commented 2 years ago

@peterbolson thanks for reporting. We should try to make Modin do better here. I'll assign this issue to you for now while you try to find the data and reproduce the slowness.

peterbolson commented 2 years ago

That sounds good. Thanks, @mvashishtha

mvashishtha commented 2 years ago

@peterbolson and I discussed this offline. One potential problem we noticed with Rob's benchmark was that Modin went after pandas, so it may have performed worse because much of the memory available on the system was already occupied with pandas data. The slowness in the video may not represent Modin's performance if it were to go before pandas. Regarding the slow functions, we have different expectations for each one, and there are potential fixes for some of them:

For anyone interested in trying out the original data file, Rob's notebooks here show how to get the data: https://github.com/RobMulla/twitch-stream-projects/tree/main/026-reddit-place-data/notebooks

mvashishtha commented 2 years ago

I filed #4916 for unique. I don't see any other performance issue we can take concrete action on any time soon.

The ray bug on calling head() is hard to debug without the full stack trace and a reproducer. We can look at it in a separate issue.

mvashishtha commented 2 years ago

@vnlitvinov also filed #4918 for cumsum and other cumulative functions.