pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.82k stars 17.99k forks source link

BUG: groupby sub-selection ignored with some methods #5264

Closed TomAugspurger closed 10 years ago

TomAugspurger commented 11 years ago

related #6512, #6524, #6346

_Update from @hayd I think these should reference _selected_obj rather than obj._

Looking through some others, looks these also ignore the selection

Aggregation functions like (they already kind of do, but they allow bad selections ie column names not in columns, may be sep issue?):

Atm selecting a column not in df doesn't raise:

what about iloc/loc/ix (current all disabled)?

The column selection on a groupby object is being ignored when .quantile() is called. So it computes the quantile on all the (numeric) columns and returns the full DataFrame.

In [92]: t = pd.DataFrame(np.random.randn(10, 4)); t[0] = np.hstack([np.ones(5), np.zeros(5)])

In [95]: t.groupby(0)[[1, 2]].quantile()  # shows other cols
Out[95]: 
   0         1         2         3
0                                 
0  0  0.127152  0.108908  0.369601
1  1 -0.321279  0.265550 -0.382398

In [96]: t[[1, 2]].groupby(t[0]).quantile()  # Should be equivalent to:
Out[96]: 
          1         2
0                    
0  0.127152  0.108908
1 -0.321279  0.265550

Seeing all these, I'm wondering if this is a bug or just how some of the methods are implementer. The docs don't mention anything about only supporting some methods though.

version: '0.12.0-883-g988d4be'

jreback commented 11 years ago

the cython routines use _iterate_slices which is where _selection is used; however _python_apply_general I think needs to be passed the _selection so that the Splitter can get access (and use) the selection. A bit tricky...but just step thru the example and you can see where it can be fixed.

I would say this is a bug / not implemented behavior

TomAugspurger commented 10 years ago

I'm taking a look at this today (the groupby code is a tad intimidating).

Would this be considered a bug as well? That the column being grouped on, 0, is included in the results and has had the function applied to it?

In [18]: t
Out[18]: 
   0         1         2         3
0  1 -0.138264  0.647689  1.523030
1  1 -0.234137  1.579213  0.767435
2  1  0.542560 -0.463418 -0.465730
3  1 -1.913280 -1.724918 -0.562288
4  1  0.314247 -0.908024 -1.412304
5  0 -0.225776  0.067528 -1.424748
6  0  0.110923 -1.150994  0.375698
7  0 -0.291694 -0.601707  1.852278
8  0 -1.057711  0.822545 -1.220844
9  0 -1.959670 -1.328186  0.196861

In [19]: t.groupby(0).apply(lambda x: x + 1)
Out[19]: 
   0         1         2         3
0  2  0.861736  1.647689  2.523030
1  2  0.765863  2.579213  1.767435
2  2  1.542560  0.536582  0.534270
3  2 -0.913280 -0.724918  0.437712
4  2  1.314247  0.091976 -0.412304
5  1  0.774224  1.067528 -0.424748
6  1  1.110923 -0.150994  1.375698
7  1  0.708306  0.398293  2.852278
8  1 -0.057711  1.822545 -0.220844
9  1 -0.959670 -0.328186  1.196861

If 0 were a column of strs then the entire apply would fail.


And just to make sure I follow this, GroupBy.__getitem__ raises a NotImplementedError, but that's just so that subclasses implement __getitem__ right?

TomAugspurger commented 10 years ago

OK, wow that was a lot of digging for a (roughly) one line change. All it comes down to is in Groupby._python_apply_general, self.grouper.apply() needs to be called with self._obj_with_exclusions instead of self.obj.

I said roughly because that would mess up the group_keys arg to groupby, so you have to check for that.


Could I get a ruling on whether the column being grouped on should be included as part of the output? My fix on the selection bug will break that. I can hack around it, but I'm not sure including the grouping column is desired. For example, from test_groupby.test_apply_multiindex_fail:

        index = MultiIndex.from_arrays([[0, 0, 0, 1, 1, 1],
                                        [1, 2, 3, 1, 2, 3]])
        df = DataFrame({'d': [1., 1., 1., 2., 2., 2.],
                        'c': np.tile(['a', 'b', 'c'], 2),
                        'v': np.arange(1., 7.)}, index=index)

        def f(group):
            v = group['v']
            group['v2'] = (v - v.min()) / (v.max() - v.min())
            return group

        result = df.groupby('d').apply(f)

        expected = df.copy()
        expected['v2'] = np.tile([0., 0.5, 1], 2)

        assert_frame_equal(result, expected)

So when I run this with my fix I get:

In [11]: df.groupby('d').apply(f)
Out[11]: 
     c  v   v2
0 1  a  1  0.0
  2  b  2  0.5
  3  c  3  1.0
1 1  a  4  0.0
  2  b  5  0.5
  3  c  6  1.0

[6 rows x 3 columns]

In [12]: expected
Out[12]: 
     c  d  v   v2
0 1  a  1  1  0.0
  2  b  1  2  0.5
  3  c  1  3  1.0
1 1  a  2  4  0.0
  2  b  2  5  0.5
  3  c  2  6  1.0

[6 rows x 4 columns]

expected is the result from before my fix. The difference is that I exclude the column being grouped on. I could see this being a pretty disruptive change. But like I said in my previous post, something like df.groupby('c').apply(lambda x: np.max(x) / 2) would fail if we don't exclude the groupby column, since its a column of strs and the applied function is looking for ints.

jreback commented 10 years ago

can u put a link to the branch?

TomAugspurger commented 10 years ago

https://github.com/TomAugspurger/pandas/tree/groupby-sugar

EDIT: forgot to push my changes. They're up now.

TomAugspurger commented 10 years ago

I guess for applys we do need to keep the column (or row) of group keys around since that's the only identifier of what values are in which group (I'm still not sure that they should have the function applied to them though).

TomAugspurger commented 10 years ago

For what it's worth, it look like R's plyr library apples the function to the grouping column:

> library("plyr", lib.loc="/usr/local/Cellar/r/3.0.2/R.framework/Versions/3.0/Resources/library")
> f <- function(x) {return(x + 1)}
> dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
> dt
         age group
1  11.137641     1
2   8.796264     2
3  13.530428     1
4  13.647518     2
5   5.647735     2
6   7.277148     1
7   6.479024     2
8   9.289560     1
9   5.211502     1
10 12.575817     2
11 11.291723     2
12 11.295667     1
13 13.752943     2
14  9.731945     1
15  4.490958     2
16 16.340906     2
17 10.620146     2
18 11.168694     1
19  3.015077     2
20 13.511891     2
> ddply(dt, ~group, f)
         age group
1  12.137641     2
2  14.530428     2
3   8.277148     2
4  10.289560     2
5   6.211502     2
6  12.295667     2
7  10.731945     2
8  12.168694     2
9   9.796264     3
10 14.647518     3
11  6.647735     3
12  7.479024     3
13 13.575817     3
14 12.291723     3
15 14.752943     3
16  5.490958     3
17 17.340906     3
18 11.620146     3
19  4.015077     3
20 14.511891     3
TomAugspurger commented 10 years ago

See this post on SO. I think this user was confused about the grouper having the function applied to it. My answer there goes through what I think happened to them, and it's not at all obvious.

Can anyone provide a reason why the apply or transform or agg function should be applied to the grouping column? I don't see any reason, but it seems to be deliberate in the code.

jreback commented 10 years ago

@TomAugspurger ok. I believe this is a bug.

Here is what happens. If the grouper (e.g. 'year' in this case) would have been a string, then then the aggregation usually will fall to the except part as the functions cannot aggregate strings, so item-by-item will work properly. (I say usually because as an example, if the function just returns 0 it doesn't care what it is aggregating and would NEVER fail; similary, if the aggregator function had been np.sum then it would work on strings as well!)

In this can since year is a number the function works and thus the grouping column IS included.

So I think that you could exclude the self.grouper.names e.g. the named columns when passing the data to the function that aggregates. Thus not triggering the exception more often.

I think to nail this down you need a simple example and basically a bunch of cases which need to be checked:

the product of these combinatons:

so 4 x 3 x 3 = 36 combinations! need to find a nice easy way to check these (this may be overkill though)

make sense?

TomAugspurger commented 10 years ago

Agreed completely. I'll make a new issue.

jreback commented 10 years ago

u have a PR for this issue (the sub selection) I believe? (or is it just the private branch?) go ahead and make a PR and can resolve this one

TomAugspurger commented 10 years ago

Do you want the "apply to grouper" issue in a separate PR? Or one PR with two commits? The PR for this issue (the sub selection being ignored) will probably depend on the PR for the apply to grouper.

jreback commented 10 years ago

maybe best to close at the same time

if not possible then can do separately

whatever works

jreback commented 10 years ago

@TomAugspurger I think all of these merely need to replace obj with _obj_with_exclusions (but tests have to be carefully looked at), the as_index=False are going to break (as they I think are wrong)

TomAugspurger commented 10 years ago

Great, that's the approach I was taking before getting muddled up with bigger API issues. You said you have a fix in the other issue right? I'd be happy look that over (especially the tests).

jreback commented 10 years ago

https://github.com/jreback/pandas/commit/92e5c505bc3b05ca0fe1d769c2967614d4732a79

trivial fix... (but a couple of other tests will break)...

hayd commented 10 years ago

apparently self._obj_with_exclusion isn't quite the right object to use (it excludes the grouped by column, regardless of as_index) Eeep.

I think good opportunity to change head and tail... see PR :S I created new method _selected_obj but maybe there is one already?

jreback commented 10 years ago

add to the list corr;

http://stackoverflow.com/questions/22238802/python-pandas-groupby-correlation-includes-all-columns-instead-of-selected-colum

@TomAugspurger maybe make a check list or something at the top?

I guess need to verify all of the tmethod by a smoke tests (these are methods on the whitelist).

perhaps change is really easy if its done with the whitelist generating function then....

hayd commented 10 years ago

I put this into a checklist and ticked off head and tail (which I fixed yest) added in a few but haven't properly gone through it. Like I say, I implemented a _selected_obj which is subtley different from other methods. One thing to be wary of is the as_index selection (good time to check whether these make sense, like we changed head)...

hayd commented 10 years ago

@jreback yea re whitelist way to do it, makes sense

jreback commented 10 years ago

pls tick the boxes when you have a chance (and put the ref PR #)

hayd commented 10 years ago

added issue number, will tick off once merged, looks like my PR fixes most of these, if not all. needs some tests for a few of the others... may need a swathe of tests for these as I think coverage is low...

hayd commented 10 years ago

I added a load to this list. Their is some weird behaviour in agg functions, where if you select a col not in the DataFrame it doesn't care:

In [60]: df = pd.DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B'])

In [61]: g = df.groupby('A')

In [62]: g[['C']]  # should raise here (maybe this will solve)
Out[62]: <pandas.core.groupby.DataFrameGroupBy object at 0x106ec6710>

In [63]: g[['C']].max()  # raises KeyError here in my PR
Out[63]:
    C
A
1 NaN
5 NaN

feature/edge case here is that A isn't displayed unless you select it

jreback commented 10 years ago

I agree this should raise KeyError

hayd commented 10 years ago

@TomAugspurger fix for a lot of this is in master now. Would you mind taking a look at what you think about the rest, is the behaviour correct? (esp. the aggs?)

TomAugspurger commented 10 years ago

Thanks for taking care of this. I may have actually lost the branch that I started to fix this, so I guess there wasn't really duplicated effort :)

I ticked the boxes for the aggs (since #6578 closes it the unrelated KeyError issue).

I haven't tested the filter and ohlc methods (haven't used either extensively, but I'll try to come up with some tests for them).

For a groupby.plot, I'm getting back multiple (identical) figures. I'll try to figure when this behavior started.

EDIT: I'm not very smart today. Of course the groupby.plot() returns multiple figures (and they aren't identical either). It correctly gets the subselected columns. I ticked the box for plot as well.

hayd commented 10 years ago

I can take care of filter, it's a one line change (was just waiting for the above to merge). I think ohlc already passes I just needs to be added to the first list in my PR.

@TomAugspurger Do you think the aggs is all good once that 6578 is fixed? Does the behaviour look right? If so, we should add tests for them. (I unchecked them 'til we add tests)

hayd commented 10 years ago

actually I'm confused if ohlc was meant to be a method, see #6594...

TomAugspurger commented 10 years ago

Any objections to closing this?

The remaining issues (ohlc and not raising a KeyError on nonexistent columns) aren't related to the original subselection being ignored and have their own issues.

jreback commented 10 years ago

no...let's close (if necessary move anything remaining to a new issue)