machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.14k stars 48 forks source link

Support for Modin and PyMars #461

Open RAbraham opened 1 year ago

RAbraham commented 1 year ago

Hi, I was wondering if siuba could support Modin(https://modin.readthedocs.io/en/stable/index.html#modin-is-a-dataframe-for-datasets-from-1mb-to-1tb) and pymars(https://docs.pymars.org/en/latest/). Both are touted as api replacements for pandas

Caveats

from siuba.data import cars
from siuba import _, filter
import modin.pandas as pd
modin_df = pd.DataFrame(cars)
result_df = filter(cars, _.mpg == _.mpg.max())
print('------------------------')
print(result_df)

print('---------- Modin --------------')
result_modin = filter(modin_df, _.mpg == _.mpg.max())
print(result_modin)

I'm interested in the Ray ML platform(both Modin and PyMars are dataframe apis over the distributed Ray platform) so if you are interested, it would be great to make this work for

pip install "modin[ray]"
machow commented 1 year ago

Hey! It looks modin DataFrames are not a subclass of the pandas DataFrame, so siuba verbs like mutate, filter, etc.. do not know they should operate on them exactly as they do for pandas.

It looks like explicitly registering things like modin does allow them to dispatch correctly:

import modin.pandas as pd
import pandas as pd2

from siuba import _, mutate

df = pd.DataFrame({'x': [1,2,3]})

mutate.register(df.__class__, mutate.dispatch(pd2.DataFrame))
mutate(df, res = _.x + 1)

It seems like there are two challenges with implementing this:

  1. We don't want to import modin every time we import siuba. So we'll either need to register an abstract base class, or put the modin implementations in a submodule. It seems like a DataFrame abstract base class would be useful, since people could also register new DataFrames to dispatch on with it.
  2. modin's DataFrameGroupBy is also not a pandas subclass, so we'll need to register it also.

(Maybe a last, future piece is that siuba has a system to speed up its pandas grouped operations, that also relies on pandas types :/. Would be quick to adjust, but requires again likely more abstract base classes, unless there's a way to connect a modin DataFrame back to pandas that I'm missing 😓)

RAbraham commented 1 year ago

Thanks for looking into it in depth.

I can't find it right now, but I think there is a way to convert from modin to pandas. Having said that, that may work if we do such a conversion after all the aggregations produce a dataframe that fits into memory but if we do it very early in the pipeline, then it may error out if the original modin dataframe is very big. re: abstract classes, sounds good. Whenever you get time :)

If you have time, what about PyMars? I think that may fall more in the LazyTbl camp along with SQL but Pymars I think allows one to run python udfs over the dataframe which we can't do with a SQL backend I guess?

RAbraham commented 1 year ago

Just curious about this idea

I ask because I started writing a library with Modin as a backend and I felt I was merely duplicating a lot of ideas that you have so beautifully executed on. Siuba is one of the finest library designs that I have come across.