Closed TomAugspurger closed 10 years ago
the cython routines use _iterate_slices
which is where _selection
is used; however _python_apply_general
I think needs to be passed the _selection so that the Splitter
can get access (and use) the selection. A bit tricky...but just step thru the example and you can see where it can be fixed.
I would say this is a bug / not implemented behavior
I'm taking a look at this today (the groupby code is a tad intimidating).
Would this be considered a bug as well? That the column being grouped on, 0
, is included in the results and has had the function applied to it?
In [18]: t
Out[18]:
0 1 2 3
0 1 -0.138264 0.647689 1.523030
1 1 -0.234137 1.579213 0.767435
2 1 0.542560 -0.463418 -0.465730
3 1 -1.913280 -1.724918 -0.562288
4 1 0.314247 -0.908024 -1.412304
5 0 -0.225776 0.067528 -1.424748
6 0 0.110923 -1.150994 0.375698
7 0 -0.291694 -0.601707 1.852278
8 0 -1.057711 0.822545 -1.220844
9 0 -1.959670 -1.328186 0.196861
In [19]: t.groupby(0).apply(lambda x: x + 1)
Out[19]:
0 1 2 3
0 2 0.861736 1.647689 2.523030
1 2 0.765863 2.579213 1.767435
2 2 1.542560 0.536582 0.534270
3 2 -0.913280 -0.724918 0.437712
4 2 1.314247 0.091976 -0.412304
5 1 0.774224 1.067528 -0.424748
6 1 1.110923 -0.150994 1.375698
7 1 0.708306 0.398293 2.852278
8 1 -0.057711 1.822545 -0.220844
9 1 -0.959670 -0.328186 1.196861
If 0
were a column of strs then the entire apply would fail.
And just to make sure I follow this, GroupBy.__getitem__
raises a NotImplementedError
, but that's just so that subclasses implement __getitem__
right?
OK, wow that was a lot of digging for a (roughly) one line change. All it comes down to is in Groupby._python_apply_general
, self.grouper.apply()
needs to be called with self._obj_with_exclusions
instead of self.obj
.
I said roughly because that would mess up the group_keys
arg to groupby, so you have to check for that.
Could I get a ruling on whether the column being grouped on should be included as part of the output? My fix on the selection bug will break that. I can hack around it, but I'm not sure including the grouping column is desired. For example, from test_groupby.test_apply_multiindex_fail
:
index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1],
[1, 2, 3, 1, 2, 3]])
df = DataFrame({'d': [1., 1., 1., 2., 2., 2.],
'c': np.tile(['a', 'b', 'c'], 2),
'v': np.arange(1., 7.)}, index=index)
def f(group):
v = group['v']
group['v2'] = (v - v.min()) / (v.max() - v.min())
return group
result = df.groupby('d').apply(f)
expected = df.copy()
expected['v2'] = np.tile([0., 0.5, 1], 2)
assert_frame_equal(result, expected)
So when I run this with my fix I get:
In [11]: df.groupby('d').apply(f)
Out[11]:
c v v2
0 1 a 1 0.0
2 b 2 0.5
3 c 3 1.0
1 1 a 4 0.0
2 b 5 0.5
3 c 6 1.0
[6 rows x 3 columns]
In [12]: expected
Out[12]:
c d v v2
0 1 a 1 1 0.0
2 b 1 2 0.5
3 c 1 3 1.0
1 1 a 2 4 0.0
2 b 2 5 0.5
3 c 2 6 1.0
[6 rows x 4 columns]
expected
is the result from before my fix. The difference is that I exclude the column being grouped on. I could see this being a pretty disruptive change. But like I said in my previous post, something like df.groupby('c').apply(lambda x: np.max(x) / 2)
would fail if we don't exclude the groupby column, since its a column of str
s and the applied function is looking for int
s.
can u put a link to the branch?
https://github.com/TomAugspurger/pandas/tree/groupby-sugar
EDIT: forgot to push my changes. They're up now.
I guess for apply
s we do need to keep the column (or row) of group keys around since that's the only identifier of what values are in which group (I'm still not sure that they should have the function applied to them though).
For what it's worth, it look like R
's plyr
library apples the function to the grouping column:
> library("plyr", lib.loc="/usr/local/Cellar/r/3.0.2/R.framework/Versions/3.0/Resources/library")
> f <- function(x) {return(x + 1)}
> dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
> dt
age group
1 11.137641 1
2 8.796264 2
3 13.530428 1
4 13.647518 2
5 5.647735 2
6 7.277148 1
7 6.479024 2
8 9.289560 1
9 5.211502 1
10 12.575817 2
11 11.291723 2
12 11.295667 1
13 13.752943 2
14 9.731945 1
15 4.490958 2
16 16.340906 2
17 10.620146 2
18 11.168694 1
19 3.015077 2
20 13.511891 2
> ddply(dt, ~group, f)
age group
1 12.137641 2
2 14.530428 2
3 8.277148 2
4 10.289560 2
5 6.211502 2
6 12.295667 2
7 10.731945 2
8 12.168694 2
9 9.796264 3
10 14.647518 3
11 6.647735 3
12 7.479024 3
13 13.575817 3
14 12.291723 3
15 14.752943 3
16 5.490958 3
17 17.340906 3
18 11.620146 3
19 4.015077 3
20 14.511891 3
See this post on SO. I think this user was confused about the grouper having the function applied to it. My answer there goes through what I think happened to them, and it's not at all obvious.
Can anyone provide a reason why the apply
or transform
or agg
function should be applied to the grouping column? I don't see any reason, but it seems to be deliberate in the code.
@TomAugspurger ok. I believe this is a bug.
Here is what happens. If the grouper (e.g. 'year' in this case) would have been a string, then then the aggregation usually will fall to the except part as the functions cannot aggregate strings, so item-by-item will work properly.
(I say usually because as an example, if the function just returns 0 it doesn't care what it is aggregating and would NEVER fail; similary, if the aggregator function had been np.sum
then it would work on strings as well!)
In this can since year is a number the function works and thus the grouping column IS included.
So I think that you could exclude the self.grouper.names
e.g. the named columns when passing the data to the function that aggregates. Thus not triggering the exception more often.
I think to nail this down you need a simple example and basically a bunch of cases which need to be checked:
the product of these combinatons:
lambda x: 0
, b), return lambda x: np.mean(x)
, c) lambda x: np.sum(x)
so 4 x 3 x 3 = 36 combinations! need to find a nice easy way to check these (this may be overkill though)
make sense?
Agreed completely. I'll make a new issue.
u have a PR for this issue (the sub selection) I believe? (or is it just the private branch?) go ahead and make a PR and can resolve this one
Do you want the "apply to grouper" issue in a separate PR? Or one PR with two commits? The PR for this issue (the sub selection being ignored) will probably depend on the PR for the apply to grouper.
maybe best to close at the same time
if not possible then can do separately
whatever works
@TomAugspurger I think all of these merely need to replace obj
with _obj_with_exclusions
(but tests have to be carefully looked at), the as_index=False
are going to break (as they I think are wrong)
Great, that's the approach I was taking before getting muddled up with bigger API issues. You said you have a fix in the other issue right? I'd be happy look that over (especially the tests).
https://github.com/jreback/pandas/commit/92e5c505bc3b05ca0fe1d769c2967614d4732a79
trivial fix... (but a couple of other tests will break)...
apparently self._obj_with_exclusion isn't quite the right object to use (it excludes the grouped by column, regardless of as_index) Eeep.
I think good opportunity to change head and tail... see PR :S I created new method _selected_obj
but maybe there is one already?
add to the list corr;
@TomAugspurger maybe make a check list or something at the top?
I guess need to verify all of the tmethod by a smoke tests (these are methods on the whitelist).
perhaps change is really easy if its done with the whitelist generating function then....
I put this into a checklist and ticked off head and tail (which I fixed yest) added in a few but haven't properly gone through it. Like I say, I implemented a _selected_obj
which is subtley different from other methods. One thing to be wary of is the as_index
selection (good time to check whether these make sense, like we changed head)...
@jreback yea re whitelist way to do it, makes sense
pls tick the boxes when you have a chance (and put the ref PR #)
added issue number, will tick off once merged, looks like my PR fixes most of these, if not all. needs some tests for a few of the others... may need a swathe of tests for these as I think coverage is low...
I added a load to this list. Their is some weird behaviour in agg functions, where if you select a col not in the DataFrame it doesn't care:
In [60]: df = pd.DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B'])
In [61]: g = df.groupby('A')
In [62]: g[['C']] # should raise here (maybe this will solve)
Out[62]: <pandas.core.groupby.DataFrameGroupBy object at 0x106ec6710>
In [63]: g[['C']].max() # raises KeyError here in my PR
Out[63]:
C
A
1 NaN
5 NaN
feature/edge case here is that A isn't displayed unless you select it
I agree this should raise KeyError
@TomAugspurger fix for a lot of this is in master now. Would you mind taking a look at what you think about the rest, is the behaviour correct? (esp. the aggs?)
Thanks for taking care of this. I may have actually lost the branch that I started to fix this, so I guess there wasn't really duplicated effort :)
I ticked the boxes for the aggs (since #6578 closes it the unrelated KeyError issue).
I haven't tested the filter
and ohlc
methods (haven't used either extensively, but I'll try to come up with some tests for them).
For a groupby.plot
, I'm getting back multiple (identical) figures. I'll try to figure when this behavior started.
EDIT: I'm not very smart today. Of course the groupby.plot()
returns multiple figures (and they aren't identical either). It correctly gets the subselected columns. I ticked the box for plot as well.
I can take care of filter, it's a one line change (was just waiting for the above to merge). I think ohlc already passes I just needs to be added to the first list in my PR.
@TomAugspurger Do you think the aggs is all good once that 6578 is fixed? Does the behaviour look right? If so, we should add tests for them. (I unchecked them 'til we add tests)
actually I'm confused if ohlc was meant to be a method, see #6594...
Any objections to closing this?
The remaining issues (ohlc and not raising a KeyError on nonexistent columns) aren't related to the original subselection being ignored and have their own issues.
no...let's close (if necessary move anything remaining to a new issue)
related #6512, #6524, #6346
_Update from @hayd I think these should reference
_selected_obj
rather thanobj
._Looking through some others, looks these also ignore the selection
Aggregation functions like (they already kind of do, but they allow bad selections ie column names not in columns, may be sep issue?):
Atm selecting a column not in df doesn't raise:
what about
iloc/loc/ix
(current all disabled)?The column selection on a groupby object is being ignored when
.quantile()
is called. So it computes the quantile on all the (numeric) columns and returns the full DataFrame.Seeing all these, I'm wondering if this is a bug or just how some of the methods are implementer. The docs don't mention anything about only supporting some methods though.
version: '0.12.0-883-g988d4be'