intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
21 stars 6 forks source link

Cannot reproduce documentation groupby (`AwkwardExtensionArray` object has no attribute `all`) #35

Open rtbs-dev opened 11 months ago

rtbs-dev commented 11 months ago

Hi all! Lovely utility here. I was playing with the example from the docs and can't quite seem to find a good workaround for this bug:

(df
 .set_index('name')
 .groupby('team', group_keys=True)
 .apply(lambda x: x.goals.ak.mean(axis=1))
)
[...] lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1630, in GroupBy._python_apply_general(self, f, data, not_indexed_same, is_transform, is_agg)
   1623     # We want to behave as if `self.group_keys=False` when reconstructing
   1624     # the object. However, we don't want to mutate the stateful GroupBy
   1625     # object, so we just override it.
...
    986 # error: Unsupported left operand type for & ("ExtensionArray")
    987 equal_na = self.isna() & other.isna()  # type: ignore[operator]
--> 988 return bool((equal_values | equal_na).all())

AttributeError: 'AwkwardExtensionArray' object has no attribute 'all'

This seems to be happening with the .agg operator as well, and the .groupby(['team','name']).apply(...) method I would usually use returns an error complaining about no attribute 'any'.

Here's my version info, as in the docs:

awkward         2.3.2
awkward_pandas  2023.8.0
numpy           1.23.5
pandas          1.5.2

I should mention that the behavior of s.ak.to_columns() appears to have changed as well, since my version returns only a single column named awkward-data, vs. the docs that have a column for every field in the array.

douglasdavis commented 11 months ago

Hi!

I have these version installed:

awkward         2.3.2
awkward_pandas  2023.8.0
numpy           1.23.5
pandas          1.5.2

And I'm unable to reproduce the error you're seeing (the example in docs is running for me with those versions). Would you be able to spin up a fresh conda/virtual environment with this versions and try again?

For completeness here's what I see locally:

In [20]: data = """
    ...: - name: Bob\n  team: tigers\n  goals: [0, 0, 0, 1, 2, 0, 1]\n\n- name: Alice\n  team: bears\n  goals: [3, 2, 1, 0, 1]\n\n- name: Jack\n  team: bears\n  goals: [0, 0, 0, 0,
    ...:  0, 0, 0, 0, 1]\n\n- name: Jill\n  team: bears\n  goals: [3, 0, 2]\n\n- name: Ted\n  team: tigers\n  goals: [0, 0, 0, 0, 0]\n\n- name: Ellen\n  team: tigers\n  goals: [1, 
    ...: 0, 0, 0, 2, 0, 1]\n\n- name: Dan\n  team: bears\n  goals: [0, 0, 3, 1, 0, 2, 0, 0]\n\n- name: Brad\n  team: bears\n  goals: [0, 0, 4, 0, 0, 1]\n\n- name: Nancy\n  team: ti
    ...: gers\n  goals: [0, 0, 1, 1, 1, 1, 0]\n\n- name: Lance\n  team: bears\n  goals: [1, 1, 1, 1, 1]\n\n- name: Sara\n  team: tigers\n  goals: [0, 1, 0, 2, 0, 3]\n\n- name: Ryan
    ...: \n  team: tigers\n  goals: [1, 2, 3, 0, 0, 0, 0]\n
    ...: """

In [21]: import yaml
    ...: 
    ...: data = yaml.load(data, Loader=yaml.SafeLoader)
    ...: data = ak.Array(data)

In [22]: s = akpd.from_awkward(data)

In [23]: df = s.ak.to_columns(extract_all=True)

In [24]: (df
    ...:  .set_index('name')
    ...:  .groupby('team', group_keys=True)
    ...:  .apply(lambda x: x.goals.ak.mean(axis=1))
    ...: )
Out[24]: 
team    name 
bears   Alice         1.4
        Jack     0.111111
        Jill     1.666667
        Dan          0.75
        Brad     0.833333
        Lance         1.0
tigers  Bob      0.571429
        Ted           0.0
        Ellen    0.571429
        Nancy    0.571429
        Sara          1.0
        Ryan     0.857143
dtype: awkward

In [25]: (df
    ...:  .set_index('name')
    ...:  .groupby(['team', 'name'], group_keys=True)
    ...:  .apply(lambda x: x.goals.ak.mean(axis=1))
    ...: )
Out[32]: 
team    name   name 
bears   Alice  Alice         1.4
        Brad   Brad     0.833333
        Dan    Dan          0.75
        Jack   Jack     0.111111
        Jill   Jill     1.666667
        Lance  Lance         1.0
tigers  Bob    Bob      0.571429
        Ellen  Ellen    0.571429
        Nancy  Nancy    0.571429
        Ryan   Ryan     0.857143
        Sara   Sara          1.0
        Ted    Ted           0.0
dtype: awkward

I'm also unable to reproduce this:

I should mention that the behavior of s.ak.to_columns() appears to have changed as well, since my version returns only a single column named awkward-data, vs. the docs that have a column for every field in the array.

In [18]: s.ak.to_columns()
Out[18]: 
     name    team                            awkward-data
0     Bob  tigers        {'goals': [0, 0, 0, 1, 2, 0, 1]}
1   Alice   bears              {'goals': [3, 2, 1, 0, 1]}
2    Jack   bears  {'goals': [0, 0, 0, 0, 0, 0, 0, 0, 1]}
3    Jill   bears                    {'goals': [3, 0, 2]}
4     Ted  tigers              {'goals': [0, 0, 0, 0, 0]}
5   Ellen  tigers        {'goals': [1, 0, 0, 0, 2, 0, 1]}
6     Dan   bears     {'goals': [0, 0, 3, 1, 0, 2, 0, 0]}
7    Brad   bears           {'goals': [0, 0, 4, 0, 0, 1]}
8   Nancy  tigers        {'goals': [0, 0, 1, 1, 1, 1, 0]}
9   Lance   bears              {'goals': [1, 1, 1, 1, 1]}
10   Sara  tigers           {'goals': [0, 1, 0, 2, 0, 3]}
11   Ryan  tigers        {'goals': [1, 2, 3, 0, 0, 0, 0]}
In [19]: s.ak.to_columns(extract_all=True)
Out[19]: 
     name    team                        goals
0     Bob  tigers        [0, 0, 0, 1, 2, 0, 1]
1   Alice   bears              [3, 2, 1, 0, 1]
2    Jack   bears  [0, 0, 0, 0, 0, 0, 0, 0, 1]
3    Jill   bears                    [3, 0, 2]
4     Ted  tigers              [0, 0, 0, 0, 0]
5   Ellen  tigers        [1, 0, 0, 0, 2, 0, 1]
6     Dan   bears     [0, 0, 3, 1, 0, 2, 0, 0]
7    Brad   bears           [0, 0, 4, 0, 0, 1]
8   Nancy  tigers        [0, 0, 1, 1, 1, 1, 0]
9   Lance   bears              [1, 1, 1, 1, 1]
10   Sara  tigers           [0, 1, 0, 2, 0, 3]
11   Ryan  tigers        [1, 2, 3, 0, 0, 0, 0]
rtbs-dev commented 11 months ago

So I downloaded the exact notebook for your "quickstart", and I started a new environment with defaults via conda, and used pip install awkward awkward-pandas ipykernel pyyaml (with a subsequent python -m ipykernel install --user --name awkward to access the kernel).

Here's the versions that gets:

awkward         2.3.2
awkward_pandas  2023.8.0
numpy           1.25.2
pandas          2.0.3

And interestingly the groupby now works, but I do reproduce the to_columns error perfectly:

s.ak.to_columns() gives

    awkward-data
0   {'name': 'Bob', 'team': 'tigers', 'goals': [0,...
1   {'name': 'Alice', 'team': 'bears', 'goals': [3...
2   {'name': 'Jack', 'team': 'bears', 'goals': [0,...
3   {'name': 'Jill', 'team': 'bears', 'goals': [3,...
4   {'name': 'Ted', 'team': 'tigers', 'goals': [0,...
5   {'name': 'Ellen', 'team': 'tigers', 'goals': [...
6   {'name': 'Dan', 'team': 'bears', 'goals': [0, ...
7   {'name': 'Brad', 'team': 'bears', 'goals': [0,...
8   {'name': 'Nancy', 'team': 'tigers', 'goals': [...
9   {'name': 'Lance', 'team': 'bears', 'goals': [1...
10  {'name': 'Sara', 'team': 'tigers', 'goals': [0...
11  {'name': 'Ryan', 'team': 'tigers', 'goals': [1...

I'll have to go now but I can try to reproduce the main error with older pandas later today, hopefully.