Open anmyachev opened 2 years ago
@anmyachev that's a good point; we should at least warn in the docs that __getitem__
is currently not as cheap as in pandas. @vnlitvinov also pointed out that we could suggest that users instead use query
, which is more concise and will combine all the logic into one remote call per row partition, e.g. for this example something like taxi_df.query("fare_amount > 1 & fare_amount < 500")
.
For a longer time horizon, @RehanSD and some others are thinking about how to introduce lazy evaluation so that we don't eagerly execute all the __getitem__
and binary operation calls as we get them.
FWIW pandas.DataFrame.__getitem__
uses _item_cache to cache these lookups
FWIW pandas.DataFrame.getitem uses _item_cache to cache these lookups
@jbrockmendel should Modin do the same? I see that __getitem__
caches the view from the first call, so subsequent calls to get an item return the same mutable view from the first call. I think we could give the same semantics in Modin.
should Modin do the same?
If it were up to me i'd re-use as much as pandas code as possible (ideally just subclass pd.DataFrame)
The code for pandas is usually written with the assumption that taking a column is a cheap operation (getting a view, without copying the memory), but this is not true for Modin, where each taking a column is an expensive copy operation due to serialization / deserialization.
We need to add antipattern into the docs, until we figure out how to make this operation cheaper.
Example (This is a part of Taxi workload code):
@modin-project/modin-core any thoughts?
Update: for simple data types, the zero copy pickling protocol will be used, which can be considered a kind of
view
for getitem operation. In this case, the main problem here will be a large number of temporary objects that will need to be serialized.