Closed bleearmstrong closed 6 years ago
Hello @bleearmstrong, sorry it's been a very long time... but I'm back now.
I tried this out in the new v0.3.0 of the package and the improved backend code appears to have resolved this issue:
In [37]: tmp = diamonds >> group_by(X.color) >> mask(X.cut == 'Ideal')
In [38]: tmp.shape
Out[38]: (21551, 10)
In [39]: tmp.head()
Out[39]:
carat cut color clarity depth table price x y z
62 0.30 Ideal D SI1 62.5 57.0 552 4.29 4.32 2.69
63 0.30 Ideal D SI1 62.1 56.0 552 4.30 4.33 2.68
120 0.71 Ideal D SI2 62.3 56.0 2762 5.73 5.69 3.56
132 0.71 Ideal D SI1 61.9 59.0 2764 5.69 5.72 3.53
144 0.71 Ideal D SI2 61.6 55.0 2767 5.74 5.76 3.54
In [40]: tmp.cut.unique()
Out[40]: array(['Ideal'], dtype=object)
That was a quick test though, so correct me if I'm wrong. Also, note that groupby
is now group_by
in v0.3.0 to be more consistent with dplyr
.
Mask doesn't seem to work correctly on grouped dataframes. Consider the three operations below.
In the first example, the mask is applied before the grouping, so it behaves as expected. In the second example, the grouping is applied before the mask. The returned dataframe includes cases where X.cut != 'Ideal', and returns about 2500 rows. I'm not sure what causes the returned rows to be returned. In the third example, there is no grouping, and the data behaves as expected (returns the same result as the first example.
There are use cases where you might want to use grouping and mask together (for example, to return the min for each group, you might want to do something like
For (1), it returns a single row, where x is not a min for any group, but (2) behaves as expected.