Closed peterbolson closed 2 years ago
@peterbolson thanks for reporting. We should try to make Modin do better here. I'll assign this issue to you for now while you try to find the data and reproduce the slowness.
That sounds good. Thanks, @mvashishtha
@peterbolson and I discussed this offline. One potential problem we noticed with Rob's benchmark was that Modin went after pandas, so it may have performed worse because much of the memory available on the system was already occupied with pandas data. The slowness in the video may not represent Modin's performance if it were to go before pandas. Regarding the slow functions, we have different expectations for each one, and there are potential fixes for some of them:
read_parquet
: we wouldn't expect modin 0.15.2 to do better than pandas, because Modin wouldn’t parallelize over columns any more than [pandas already would](https://arrow.apache.org/docs/python/parquet.html#multithreaded-reads. modin has to pay extra costs, including the cost of putting the data in the object store. The latest Modin source version will parallelize over row groups, so that might help, but I don't know how many row groups this file had.std
and mean
on a column: Usually when pandas is doing something like 88 ms for one operation, we don't expect Modin to do that much better. The overhead of serializing and deserializing the data can outweigh the benefits of parallell execution in such cases. Experimenting with a similar data notebook, @peterbolson found that Modin was not drastically slower than pandas when Modin ran before pandas, so I don't think we have any outstanding tasks here.unique
: there’s no way modin’s implementation for a series is better than pandas. modin’s implementation for a single column gives no parallelism, because modin executes unique
column-wise. @vnlitvinov pointed out that we could look into parallelizing unique
over the row axis.cumsum
: this function actually executed asynchronously, and Rob didn't block on getting the results by printing them, as he did in other tests. I'm not sure how Modin would have done here. Results from repeating this test on smaller data and letting Modin go first suggest that Modin wouldn't have done much better. cumsum
is also applied on column partitions, like std
and mean
. I can't think of anything we can do right now to make it much faster.For anyone interested in trying out the original data file, Rob's notebooks here show how to get the data: https://github.com/RobMulla/twitch-stream-projects/tree/main/026-reddit-place-data/notebooks
I filed #4916 for unique
. I don't see any other performance issue we can take concrete action on any time soon.
The ray bug on calling head()
is hard to debug without the full stack trace and a reproducer. We can look at it in a separate issue.
@vnlitvinov also filed #4918 for cumsum
and other cumulative functions.
Rob Mulla, in this YouTube video, benchmarked pandas vs. Modin vs. Ray vs. Dask vs. Vaex on a 16 gigabyte Reddit Place dataset (which I believe we can pull and clean using three notebooks here: https://github.com/RobMulla/twitch-stream-projects/tree/main/026-reddit-place-data/notebooks, but I will check with Rob).
Two Modin-relevant things came out of this:
OS: Unknown (Rob will have to fill this in) Cores: It looks like there are 64 Memory: I'm not sure, but above 16 GB (I'll have to ask Rob) Modin version: 0.14.0 Pandas version: 1.4.2
Two questions: The first and most important: Why, on a dataset this large, isn't Modin performing better than pandas (except for on cumsum)? Does it have something to do with the fact that there are only five columns? Is parallelization really happening? (I couldn't tell, but from some of the brief screen shares of the Ray worker dashboard, I couldn't see whether more than one core was active.) Does Dask do better? It seems like this should be an instance where parallelization would lead to fairly substantial returns, particularly because Rob's machine appears to have 64 cores.
The second question is much more narrow -- Why did Rob get an error when he ran .head()?
I'll try to follow up on this by getting you the dataset Rob used, either by running the code in his personal repo, or by seeing if he can give us a link to the prepared data. For the exact commands Rob used, we can check out the Modin section of the YouTube video.
Thank you!