pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

pandas.core.groupby.GroupBy.apply fails #20949

Closed MBlistein closed 6 years ago

MBlistein commented 6 years ago

Code Sample:

>>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
>>> g = df.groupby('A')
>>> g.apply(lambda x: x / x.sum())

Problem description

Applying a function to a grouped data frame fails. The code above is the example code from the official pandas documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html

Output to the above code:

/usr/local/lib/python2.7/dist-packages/pandas/core/computation/check.py:17: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.4.6

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 805, in apply
    return self._python_apply_general(f)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 809, in _python_apply_general
    self.axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 1969, in apply
    res = f(group)
  File "<stdin>", line 1, in <lambda>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 1262, in f
    return self._combine_series(other, na_op, fill_value, axis, level)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 3944, in _combine_series
    try_cast=try_cast)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 3958, in _combine_series_infer
    try_cast=try_cast)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 3981, in _combine_match_columns
    try_cast=try_cast)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3435, in eval
    return self.apply('eval', **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3329, in apply
    applied = getattr(b, f)(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1377, in eval
    result = get_result(other)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1346, in get_result
    result = func(values, other)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 1216, in na_op
    yrav.fill(yrav.item())
ValueError: can only convert an array of size 1 to a Python scalar

The error can be 'fixed' by applying another command to the grouped object first:

>>> g.sum()
   B   C
A       
a  3  10
b  3   5

>>> g.apply(lambda x: x / x.sum())
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0

Expected Output

>>> g.apply(lambda x: x / x.sum())
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0

Output of pd.show_versions()

>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-122-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.utf8 LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.22.0 pytest: 2.8.7 pip: 9.0.1 setuptools: 20.7.0 Cython: 0.23.4 numpy: 1.11.0 scipy: 0.17.0 pyarrow: None xarray: None IPython: 5.5.0 sphinx: None patsy: 0.4.1 dateutil: 2.4.2 pytz: 2014.10 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.4.3 feather: None matplotlib: 1.5.1 openpyxl: 2.3.0 xlrd: 0.9.4 xlwt: 0.7.5 xlsxwriter: None lxml: 3.5.0 bs4: None html5lib: 1.0.1 sqlalchemy: 1.0.11 pymysql: 0.7.2.None psycopg2: 2.6.1 (dt dec mx pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None >>>
TomAugspurger commented 6 years ago

Thanks for the bug report.

WillAyd commented 6 years ago

Hmm interesting. FWIW when I remove numexpr I can't get this to run at all, regardless of whether or not I run another agg function first.

WillAyd commented 6 years ago

Numexpr may be a red herring. From what I can tell the problem occurs at the following line of code:

https://github.com/pandas-dev/pandas/blob/ef019faa06f762c8c203985a11108731384b2dae/pandas/core/groupby/groupby.py#L5063

sdata when run without another agg function first includes the Grouping as part of the data and throws here, causing it to go down another path. sdata comes from _selected_obj.

For agg functions like sum, mean, etc... they have a call to _set_group_selection which takes care of setting the appropriately cached value for _selected_obj. I suppose a quick fix is to add a call to that at the beginning of apply, though I can't tell from the code alone why that isn't done across the board

cc @jreback for any insight

Dr-Irv commented 6 years ago

Here's another example that fails with 0.23rc2 (and in 0.22.0 as well), based on code from pandas\core\indexes\datetimes.py in test_agg_timezone_round_trip:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.0rc2'

In [3]: dates = [pd.Timestamp("2016-01-0%d 12:00:00" % i, tz='US/Pacific')
   ...:          for i in range(1, 5)]
   ...: df = pd.DataFrame({'A': ['a', 'b'] * 2, 'B': dates})
   ...: grouped = df.groupby('A')
   ...:

In [4]: df
Out[4]:
   A                         B
0  a 2016-01-01 12:00:00-08:00
1  b 2016-01-02 12:00:00-08:00
2  a 2016-01-03 12:00:00-08:00
3  b 2016-01-04 12:00:00-08:00

In [5]: grouped.apply(lambda x: x.iloc[0])[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5720)()
    138
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5566)()
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22442)()
   1491
-> 1492     cpdef get_item(self, object val):
   1493         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22396)()
   1499         else:
-> 1500             raise KeyError(val)
   1501

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-2b16555d6e05> in <module>()
----> 1 grouped.apply(lambda x: x.iloc[0])[0]

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\frame.py in __getitem__(self, key)
   2685             return self._getitem_multilevel(key)
   2686         else:
-> 2687             return self._getitem_column(key)
   2688
   2689     def _getitem_column(self, key):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\frame.py in _getitem_column(self, key)
   2692         # get column
   2693         if self.columns.is_unique:
-> 2694             return self._get_item_cache(key)
   2695
   2696         # duplicate columns & possible reduce dimensionality

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\generic.py in _get_item_cache(self, item)
   2485         res = cache.get(item)
   2486         if res is None:
-> 2487             values = self._data.get(item)
   2488             res = self._box_item_values(item, values)
   2489             cache[item] = res

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\internals.py in get(self, item, fastpath)
   4113
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3063                 return self._engine.get_loc(key)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066
   3067         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5720)()
    137             util.set_value_at(arr, loc, value)
    138
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):
    141             raise TypeError("'{val}' is an invalid key".format(val=val))

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5566)()
    159
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):
    163             raise KeyError(val)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22442)()
   1490                                        sizeof(uint32_t)) # flags
   1491
-> 1492     cpdef get_item(self, object val):
   1493         cdef khiter_t k
   1494         if val != val or val is None:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22396)()
   1498             return self.table.vals[k]
   1499         else:
-> 1500             raise KeyError(val)
   1501
   1502     cpdef set_item(self, object key, Py_ssize_t val):

KeyError: 0

However, if you do the following, it works:

In [6]: grouped.nth(0)['B'].iloc[0]
Out[6]: Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')

In [7]: grouped.apply(lambda x: x.iloc[0])[0]
Out[7]: Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')

So doing one operation (in this case nth) prior to the apply then makes the apply work.

WillAyd commented 6 years ago

@Dr-Irv seems related. Some code below illustrating what I think is going on:

>>> grouped.apply(lambda x: x.iloc[0])[0]  # KeyError as indicator
KeyError

>>> grouped._set_group_selection()
>>> grouped.apply(lambda x: x.iloc[0])[0]  # Works now, as 'A' was not part of data
Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')

>>> grouped._reset_group_selection()  # Clear out the group selection
>>> grouped.apply(lambda x: x.iloc[0])[0]  # Back to failing
KeyError

Unfortunately just adding this call before _python_apply_general broke other tests where the grouping was supposed to be part of the returned object (at least according to the tests). Reviewing in more detail hope to have a PR soon

jreback commented 6 years ago

this didn't work even in 0.20.3. not sure how we don't have a test for it though.

jreback commented 6 years ago

@Dr-Irv your example is a separate issue. pls make a new report for that one.