Open valkum opened 4 years ago
It seems that sampled=df.reset_index().groupby('team').resample("1D", on='date')
fixes the issue, but I am not sure if this would still be considered a bug.
@valkum Thanks for the bug report!
This is likely related to #33548. I don't think it has anything to do with group sizes, as this code produces the same out of bounds error:
import numpy as np
import pandas as pd
data = {
'date': ['2000-01-01','2000-01-01', '2000-01-02', '2000-01-01', '2000-01-02'],
'team': ['client1', 'client1', 'client1', 'client2', 'client2'],
'temp': [0.780302, 0.780302, 0.035013, 0.355633, 0.243835],
}
df = pd.DataFrame( data )
df['date'] = pd.to_datetime(df['date'])
df = df.drop(df.index[1])
sampled=df.groupby('team').resample("1D", on='date')
#Returns IndexError
sampled.agg({'temp': np.mean})
#Returns IndexError as well
sampled['temp'].mean()
Also sampled.mean()
works, it's only sampled['temp'].mean()
that breaks.
Seeing as reset_index
fixes it, maybe the break in the index causes the bug.
Thanks for your reply.
sampled.agg(np.mean)
works too, only when you try to a pass a dict (to only cover specific columns) it breaks.
Furthermore your example does work for me with out an out of bounds error, but creates different results nevertheless. See here
Its only when you drop a row after the DataFrame is created, and as you pointed out, the Index is not continous anymore.
So it is somehow a bug caused by non-continous indices combined with selecting an aggregation function on specific columns (either bei sampled['temp'].mean()
or sampled.agg({'temp': np.mean})
)
But I see that it might be related to #33548
Interesting. For me my code breaks both on 1.0.5
and on the latest commit of master
.
UPDATE: ah, forgot to drop the second row. @valkum , could you run the updated code to make sure that it breaks, and that we aren't dealing with something super-weird?
Investigated this a bit. The object we end up with is of class pandas.core.resample.DatetimeIndexResamplerGroupby
, which is a non-transparent descendant of GroupByMixin
and DatetimeIndexResampler
, and uncovering what exactly is causing bugs when using aggregate functions is non-trivial.
I'll try to track down this bug next week.
take
Interesting. The bug can be "fixed" by using a deep copy in _apply
in _GroupByMixin
. We must be forgetting something when creating a shallow copy, which causes _set_grouper
to crash. Will keep investigating.
Okay, so what happens is that df.index
values get used deep down the call stack to draw dates from the DatetimeIndex
that the grouping and resampling operations create. This is done through Index.take
, and because the DatetimeIndex
has only four elements in it, and we are trying to get the element with index 4, we get a KeyError
. This is why resetting the index fixes this.
The whole process is necessary, because we apply aggregation functions by creating shallow copies of Series
objects and applying the functions to them.
Here is a link to the relevant code.
As far as I can tell, we don't need to preserve the original row index before applying aggregation functions to a DatetimeIndexResamplerGroupby
, so the obvious way would be to reset the index somewhere down the call stack to be safe. I'll see if I can find a good candidate spot.
Thanks for your efforts. I might have found another bug which might be related to this where agg with a dict as arg will compute something different, but i am not sure. There is a similar issue open so I posted my PoC there #27343.
Thanks for the info. I'll look deeper into these bugs this weekend. The improper sampling of Datetime
using the DataFrame.index
as nparray.index
probably has multiple effects (so it might be causing multiple bugs), but it's difficult to say until we think of a decent way to fix it and implement it.
@jreback I'd like to ask for a bit of help from the team with this one. Maybe you can see a way out of this bug or know someone who might be able to help with a groupby resampler issue? I diagnosed the problem, but hit a wall in fixing it.
When we call aggregate functions on a column of a DatetimeIndexResamplerGroupby
instance that is resampled on a date column, we end up drawing dates with DatetimeIndex.take
, and the values we pass to it are taken from the index of the original DataFrame
. This mechanism leads to two things:
DataFrame.index
is anything except a RangeIndex
starting with 0, the thing breaks with an index error. So if we drop an index as OP did, or if the DataFrame
is indexed with a DatetimeIndex
, as in the example below, nothing works.ResamplerGroupby
subtype is to get data that's grouped by the groupby columns and then by the resampling frequency of the resampler. What we end up with instead is that for each groupby group the code attempts to resample the data with take
and then collapse it into one number with the aggregate function.The problem with fixing this mess is that the functionality is implemented in the inheritance chain, and I've so far been unable to fix it without breaking the Resampler
class in horrible ways.
Here is a minimal case to reproduce the bug:
import pandas as pd
df = pd.DataFrame({'date' : [pd.to_datetime('2000-01-01')], 'group' : [1], 'value': [1]},
index=pd.DatetimeIndex(['2000-01-01']))
df.groupby('group').resample('1D', on='date')['value'].mean()
This ends up throwing:
index 946684800000000000 is out of bounds for size 1
Deep down the call stack, we create a DatetimeIndex
based on the date
column and then we call DatetimeIndex.take
on it passing values from df.index
.
I'd appreciate some help with finding a viable approach here.
Below is the full error traceback for this case:
@AlexKirko havent looked closely but the issue is that you don't want to use .take too early that converts indexers (eg position in an index) to the index value itself
we ideally want to convert only at the very end
Makes sense, thanks. I'll try and look at the differences between calling aggregate functions on a ResamplerGroupby
without selecting a column (which works) and with it (which ends up passing original DataFrame
index values to take
and breaks). Maybe that will help.
Another example of this happening:
df = pd.DataFrame({
'a': range(10),
'time': pd.date_range('2020-01-01', '2020-01-10', freq='D')
})
Using both groupby and resample:
df.iloc[range(0, 10, 2)].groupby('a'.resample('D', on='time')['a'].mean()
It fails with an IndexError:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../lib/python3.8/site-packages/pandas/core/resample.py", line 968, in g
return self._downsample(_method)
File ".../lib/python3.8/site-packages/pandas/core/resample.py", line 1024, in _apply
result = self._groupby.apply(func)
File ".../lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 221, in apply
return super().apply(func, *args, **kwargs)
File ".../lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 894, in apply
result = self._python_apply_general(f, self._selected_obj)
File ".../lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 928, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
File ".../lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 238, in apply
res = f(group)
File ".../lib/python3.8/site-packages/pandas/core/resample.py", line 1017, in func
x = self._shallow_copy(x, groupby=self.groupby)
File ".../lib/python3.8/site-packages/pandas/core/groupby/base.py", line 31, in _shallow_copy
return self._constructor(obj, **kwargs)
File ".../lib/python3.8/site-packages/pandas/core/resample.py", line 103, in __init__
self.groupby._set_grouper(self._convert_obj(obj), sort=True)
File ".../lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 362, in _set_grouper
ax = self._grouper.take(obj.index)
File ".../lib/python3.8/site-packages/pandas/core/indexes/datetimelike.py", line 208, in take
result = NDArrayBackedExtensionIndex.take(
File ".../lib/python3.8/site-packages/pandas/core/indexes/base.py", line 751, in take
taken = algos.take(
File ".../lib/python3.8/site-packages/pandas/core/algorithms.py", line 1657, in take
result = arr.take(indices, axis=axis)
File ".../lib/python3.8/site-packages/pandas/core/arrays/_mixins.py", line 71, in take
new_data = take(
File ".../lib/python3.8/site-packages/pandas/core/algorithms.py", line 1657, in take
result = arr.take(indices, axis=axis)
IndexError: index 6 is out of bounds for axis 0 with size 5
Resetting the index before grouping gives the correct result:
df.iloc[range(0, 10, 2)].reset_index().groupby('a').resample('D', on='time')['a'].mean()
a time
0 2020-01-01 0
2 2020-01-03 2
4 2020-01-05 4
6 2020-01-07 6
8 2020-01-09 8
Name: a, dtype: int64
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
See here as well: https://repl.it/@valkum/WrithingNotablePascal
Problem description
agg fails with
IndexError: index 3 is out of bounds for axis 0 with size 3
Note that this does work as expected when I do not drop a row after createing the DataFrame, so I assume it is caused by the index.
Expected Output
No fail.
Output of
pd.show_versions()