pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.85k forks source link

BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category #47147

Open BraisLP opened 2 years ago

BraisLP commented 2 years ago

I discovered this in pandas 1.2.5 and I tested it in 1.4.2 and 1.3.1. I haven't found a similar issue reported here. I'm using python 3.9.10

Here's a "minimal" example: if I provide e.g. two rows to pd.crosstab, one of which is of categorical type, there's odd behaviour exclusively when aggfunc=sum or np.sum:

import pandas as pd

dummy = pd.Series([0]*6, name='dummy')
rowcat = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='category', name='cat')
rowstr = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='str', name='str')
values = pd.Series([1, 2, 2, 1, 2, 2], name='val')

withsum = pd.crosstab([rowstr, rowcat],  # it looks the same for [rowcat, rowstr]
            columns=dummy,
            values=values,
            aggfunc=sum,  # same for np.sum
)

This produces the odd-looking matrix:

>>> withsum
dummy    0
str cat   
a   a    2
    b    0
b   a    0
    b    8

Compare to the following three examples:

withmin = pd.crosstab(index=[rowstr, rowcat],
            columns=dummy,
            values=values,
            aggfunc=min,  # or any other np agg func -> it's OK
)
>>> withmin
dummy    0
str cat   
a   a    1
b   b    2
withlambda = pd.crosstab(index=[rowstr, rowcat],
            columns=dummy,
            values=values,
            aggfunc=lambda x: sum(x),  # dirty fix -> it's OK!
)
>>> withlambda
dummy    0
str cat   
a   a    2
b   b    8
withoutfunc = pd.crosstab(index=[rowstr, rowcat],
            columns=dummy,
)  # no agg func-> it's OK
>>> withoutfunc
dummy    0
str cat   
a   a    2
b   b    4
MaruthiKo commented 2 years ago

I'm using pandas 1.3.4 and python 3.9.13 I am getting the same output

Screenshot (130) Screenshot (129) Screenshot (128)
MaruthiKo commented 2 years ago

Hi, Can I help in contributing to this issue?

simonjayhawkins commented 2 years ago

Thanks @BraisLP for the report

Here's a "minimal" example: if I provide e.g. two rows to pd.crosstab, one of which is of categorical type, there's odd behaviour exclusively when aggfunc=sum or np.sum:

from the docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

This produces the odd-looking matrix:

given the above, I'm not so sure if this isn't the expected behavior and the other cases are perhaps the incorrect ones.

for instance if the cat series included an unused 'c'.

for the withsum case in the OP, you get

dummy    0
str cat   
a   a    2
    b    0
    c    0
b   a    0
    b    8
    c    0

and for the withmin case

dummy    0
str cat   
a   a    1
b   b    2

The withmin case does not include the unused category as so is IMO incorrect and the withsum case repeats the unused category for all values in the first level which makes sense and consitent with including the extra rows in the withsum case without unused categories.

simonjayhawkins commented 2 years ago

16367 related

BraisLP commented 2 years ago

I thought I'd add the example with the inverted row order that I mention in a comment: it includes all of the "categories" even if the inner row is not categorical.

withsum_inverted = pd.crosstab([rowcat, rowstr],
            columns=dummy,
            values=values,
            aggfunc=sum,  # same for np.sum
)
>>> withsum_inverted
dummy    0
cat str   
a   a    2
    b    0
b   a    0
    b    8
simonjayhawkins commented 2 years ago

I thought I'd add the example with the inverted row order that I mention in a comment: it includes all of the "categories" even if the inner row is not categorical.

again, what would be the expected output if the category Series includes an used category. The first level of the MultiIndex should include the unused category (according to the docs), but what would be the second level? maybe a missing value indicator (such as np.nan) or all of the unique values in the second level? (i.e. the MultiIndex should be the product of the categories with the unique values in the non categorical series)

creating the MutiIndex from the product would be consistent with the current sum behavior for the reversed level order.

BraisLP commented 2 years ago

again, what would be the expected output if the category Series includes an used category

In this last example, the column that gets "expanded" to include unused values is not categorical (the column titled "str" was defined in my first post as rowstr = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='str', name='str')). Maybe internally it got converted to a categorical.

Time permitting, I'm going to have a look at this in more detail...