kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

Strange behavior of Mask on grouped df #24

Closed bleearmstrong closed 6 years ago

bleearmstrong commented 7 years ago

Mask doesn't seem to work correctly on grouped dataframes. Consider the three operations below.

diamonds >> mask(X.cut == 'Ideal') >> groupby(X.color)  # works as expected
diamonds >> groupby(X.color) >> mask(X.cut == 'Ideal')  # doesn't work correctly (strange behavior)
diamonds >> mask(X.cut == 'Ideal')                                    # works as expected

In the first example, the mask is applied before the grouping, so it behaves as expected. In the second example, the grouping is applied before the mask. The returned dataframe includes cases where X.cut != 'Ideal', and returns about 2500 rows. I'm not sure what causes the returned rows to be returned. In the third example, there is no grouping, and the data behaves as expected (returns the same result as the first example.

There are use cases where you might want to use grouping and mask together (for example, to return the min for each group, you might want to do something like

diamonds >> groupby(X.color) >> mask(X.x == X.x.min())        (1)
diamonds >> mask(X.x == X.x.min())                                          (2)

For (1), it returns a single row, where x is not a min for any group, but (2) behaves as expected.

kieferk commented 6 years ago

Hello @bleearmstrong, sorry it's been a very long time... but I'm back now.

I tried this out in the new v0.3.0 of the package and the improved backend code appears to have resolved this issue:

In [37]: tmp = diamonds >> group_by(X.color) >> mask(X.cut == 'Ideal')

In [38]: tmp.shape
Out[38]: (21551, 10)

In [39]: tmp.head()
Out[39]: 
     carat    cut color clarity  depth  table  price     x     y     z
62    0.30  Ideal     D     SI1   62.5   57.0    552  4.29  4.32  2.69
63    0.30  Ideal     D     SI1   62.1   56.0    552  4.30  4.33  2.68
120   0.71  Ideal     D     SI2   62.3   56.0   2762  5.73  5.69  3.56
132   0.71  Ideal     D     SI1   61.9   59.0   2764  5.69  5.72  3.53
144   0.71  Ideal     D     SI2   61.6   55.0   2767  5.74  5.76  3.54

In [40]: tmp.cut.unique()
Out[40]: array(['Ideal'], dtype=object)

That was a quick test though, so correct me if I'm wrong. Also, note that groupby is now group_by in v0.3.0 to be more consistent with dplyr.