Strange behavior of Mask on grouped df

kieferk / dfply

dplyr-style piping operations for pandas dataframes

GNU General Public License v3.0

889 stars 103 forks source link

Mask doesn't seem to work correctly on grouped dataframes. Consider the three operations below.

diamonds >> mask(X.cut == 'Ideal') >> groupby(X.color)  # works as expected
diamonds >> groupby(X.color) >> mask(X.cut == 'Ideal')  # doesn't work correctly (strange behavior)
diamonds >> mask(X.cut == 'Ideal')                                    # works as expected

In the first example, the mask is applied before the grouping, so it behaves as expected. In the second example, the grouping is applied before the mask. The returned dataframe includes cases where X.cut != 'Ideal', and returns about 2500 rows. I'm not sure what causes the returned rows to be returned. In the third example, there is no grouping, and the data behaves as expected (returns the same result as the first example.

There are use cases where you might want to use grouping and mask together (for example, to return the min for each group, you might want to do something like

diamonds >> groupby(X.color) >> mask(X.x == X.x.min())        (1)
diamonds >> mask(X.x == X.x.min())                                          (2)

For (1), it returns a single row, where x is not a min for any group, but (2) behaves as expected.

In [37]: tmp = diamonds >> group_by(X.color) >> mask(X.cut == 'Ideal') In [38]: tmp.shape Out[38]: (21551, 10) In [39]: tmp.head() Out[39]: carat cut color clarity depth table price x y z 62 0.30 Ideal D SI1 62.5 57.0 552 4.29 4.32 2.69 63 0.30 Ideal D SI1 62.1 56.0 552 4.30 4.33 2.68 120 0.71 Ideal D SI2 62.3 56.0 2762 5.73 5.69 3.56 132 0.71 Ideal D SI1 61.9 59.0 2764 5.69 5.72 3.53 144 0.71 Ideal D SI2 61.6 55.0 2767 5.74 5.76 3.54 In [40]: tmp.cut.unique() Out[40]: array(['Ideal'], dtype=object)

kieferk / dfply

Strange behavior of Mask on grouped df #24