modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.76k stars 651 forks source link

Modin antipattern of `__getitem__` usage #4858

Open anmyachev opened 2 years ago

anmyachev commented 2 years ago

The code for pandas is usually written with the assumption that taking a column is a cheap operation (getting a view, without copying the memory), but this is not true for Modin, where each taking a column is an expensive copy operation due to serialization / deserialization.

We need to add antipattern into the docs, until we figure out how to make this operation cheaper.

Example (This is a part of Taxi workload code):

taxi_df = taxi_df[
    (taxi_df.fare_amount > 1) &
    (taxi_df.fare_amount < 500) &
    (taxi_df.passenger_count > 0) &
    (taxi_df.passenger_count < 6) &
    (taxi_df.pickup_longitude > -75) &
    (taxi_df.pickup_longitude < -73) &
    (taxi_df.dropoff_longitude > -75) &
    (taxi_df.dropoff_longitude < -73) &
    (taxi_df.pickup_latitude > 40) &
    (taxi_df.pickup_latitude < 42) &
    (taxi_df.dropoff_latitude > 40) &
    (taxi_df.dropoff_latitude < 42) &
    (taxi_df.trip_distance > 0) &
    (taxi_df.trip_distance < 500) &
    ((taxi_df.trip_distance <= 50) | (taxi_df.fare_amount >= 50)) &
    ((taxi_df.trip_distance >= 10) | (taxi_df.fare_amount <= 300)) &
    (taxi_df.dropoff_datetime > taxi_df.pickup_datetime)]

@modin-project/modin-core any thoughts?

Update: for simple data types, the zero copy pickling protocol will be used, which can be considered a kind of view for getitem operation. In this case, the main problem here will be a large number of temporary objects that will need to be serialized.

mvashishtha commented 2 years ago

@anmyachev that's a good point; we should at least warn in the docs that __getitem__ is currently not as cheap as in pandas. @vnlitvinov also pointed out that we could suggest that users instead use query, which is more concise and will combine all the logic into one remote call per row partition, e.g. for this example something like taxi_df.query("fare_amount > 1 & fare_amount < 500").

For a longer time horizon, @RehanSD and some others are thinking about how to introduce lazy evaluation so that we don't eagerly execute all the __getitem__ and binary operation calls as we get them.

jbrockmendel commented 2 years ago

FWIW pandas.DataFrame.__getitem__ uses _item_cache to cache these lookups

mvashishtha commented 2 years ago

FWIW pandas.DataFrame.getitem uses _item_cache to cache these lookups

@jbrockmendel should Modin do the same? I see that __getitem__ caches the view from the first call, so subsequent calls to get an item return the same mutable view from the first call. I think we could give the same semantics in Modin.

jbrockmendel commented 2 years ago

should Modin do the same?

If it were up to me i'd re-use as much as pandas code as possible (ideally just subclass pd.DataFrame)