Why did Modin perform worse than pandas on a 16 GB dataset?

peterbolson commented 2 years ago

Rob Mulla, in this YouTube video, benchmarked pandas vs. Modin vs. Ray vs. Dask vs. Vaex on a 16 gigabyte Reddit Place dataset (which I believe we can pull and clean using three notebooks here: https://github.com/RobMulla/twitch-stream-projects/tree/main/026-reddit-place-data/notebooks, but I will check with Rob).

Two Modin-relevant things came out of this:

Modin was slower than pandas on .read_parquet(), .head(), .std(), .mean(), and .unique(). It was faster on .cumsum().
Modin also threw an error ("ERROR:root:Internal Python error in the inspect module" at one of the steps (when .head() was called), after this warning popped up: "This worker was asked to execute a function that it does not have registered. You may have to restart Ray."

OS: Unknown (Rob will have to fill this in) Cores: It looks like there are 64 Memory: I'm not sure, but above 16 GB (I'll have to ask Rob) Modin version: 0.14.0 Pandas version: 1.4.2

Two questions: The first and most important: Why, on a dataset this large, isn't Modin performing better than pandas (except for on cumsum)? Does it have something to do with the fact that there are only five columns? Is parallelization really happening? (I couldn't tell, but from some of the brief screen shares of the Ray worker dashboard, I couldn't see whether more than one core was active.) Does Dask do better? It seems like this should be an instance where parallelization would lead to fairly substantial returns, particularly because Rob's machine appears to have 64 cores.

The second question is much more narrow -- Why did Rob get an error when he ran .head()?

I'll try to follow up on this by getting you the dataset Rob used, either by running the code in his personal repo, or by seeing if he can give us a link to the prepared data. For the exact commands Rob used, we can check out the Modin section of the YouTube video.

Thank you!

mvashishtha commented 2 years ago

@peterbolson thanks for reporting. We should try to make Modin do better here. I'll assign this issue to you for now while you try to find the data and reproduce the slowness.

peterbolson commented 2 years ago

That sounds good. Thanks, @mvashishtha

mvashishtha commented 2 years ago

@peterbolson and I discussed this offline. One potential problem we noticed with Rob's benchmark was that Modin went after pandas, so it may have performed worse because much of the memory available on the system was already occupied with pandas data. The slowness in the video may not represent Modin's performance if it were to go before pandas. Regarding the slow functions, we have different expectations for each one, and there are potential fixes for some of them:

read_parquet: we wouldn't expect modin 0.15.2 to do better than pandas, because Modin wouldn’t parallelize over columns any more than [pandas already would](https://arrow.apache.org/docs/python/parquet.html#multithreaded-reads. modin has to pay extra costs, including the cost of putting the data in the object store. The latest Modin source version will parallelize over row groups, so that might help, but I don't know how many row groups this file had.
std and mean on a column: Usually when pandas is doing something like 88 ms for one operation, we don't expect Modin to do that much better. The overhead of serializing and deserializing the data can outweigh the benefits of parallell execution in such cases. Experimenting with a similar data notebook, @peterbolson found that Modin was not drastically slower than pandas when Modin ran before pandas, so I don't think we have any outstanding tasks here.
unique: there’s no way modin’s implementation for a series is better than pandas. modin’s implementation for a single column gives no parallelism, because modin executes unique column-wise. @vnlitvinov pointed out that we could look into parallelizing unique over the row axis.
groupby mean: #4288 (not yet in Modin 0.15.2) should improve performance, though still after this PR I don't think we see Modin doing much better than pandas on narrow datasets (see the benchmarks in the PR). We don't have more improvements for this function in the pipeline, but I think Modin needs to do better.
cumsum: this function actually executed asynchronously, and Rob didn't block on getting the results by printing them, as he did in other tests. I'm not sure how Modin would have done here. Results from repeating this test on smaller data and letting Modin go first suggest that Modin wouldn't have done much better. cumsum is also applied on column partitions, like std and mean. I can't think of anything we can do right now to make it much faster.

For anyone interested in trying out the original data file, Rob's notebooks here show how to get the data: https://github.com/RobMulla/twitch-stream-projects/tree/main/026-reddit-place-data/notebooks

mvashishtha commented 2 years ago

I filed #4916 for unique. I don't see any other performance issue we can take concrete action on any time soon.

The ray bug on calling head() is hard to debug without the full stack trace and a reproducer. We can look at it in a separate issue.

mvashishtha commented 2 years ago

@vnlitvinov also filed #4918 for cumsum and other cumulative functions.

modin-project / modin

Why did Modin perform worse than pandas on a 16 GB dataset? #4798