Open BraisLP opened 2 years ago
I'm using pandas 1.3.4 and python 3.9.13 I am getting the same output
Hi, Can I help in contributing to this issue?
Thanks @BraisLP for the report
Here's a "minimal" example: if I provide e.g. two rows to
pd.crosstab
, one of which is of categorical type, there's odd behaviour exclusively whenaggfunc=sum
ornp.sum
:
from the docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.
This produces the odd-looking matrix:
given the above, I'm not so sure if this isn't the expected behavior and the other cases are perhaps the incorrect ones.
for instance if the cat series included an unused 'c'
.
for the withsum case in the OP, you get
dummy 0
str cat
a a 2
b 0
c 0
b a 0
b 8
c 0
and for the withmin case
dummy 0
str cat
a a 1
b b 2
The withmin case does not include the unused category as so is IMO incorrect and the withsum case repeats the unused category for all values in the first level which makes sense and consitent with including the extra rows in the withsum case without unused categories.
I thought I'd add the example with the inverted row order that I mention in a comment: it includes all of the "categories" even if the inner row is not categorical.
withsum_inverted = pd.crosstab([rowcat, rowstr],
columns=dummy,
values=values,
aggfunc=sum, # same for np.sum
)
>>> withsum_inverted
dummy 0
cat str
a a 2
b 0
b a 0
b 8
I thought I'd add the example with the inverted row order that I mention in a comment: it includes all of the "categories" even if the inner row is not categorical.
again, what would be the expected output if the category Series includes an used category. The first level of the MultiIndex should include the unused category (according to the docs), but what would be the second level? maybe a missing value indicator (such as np.nan) or all of the unique values in the second level? (i.e. the MultiIndex should be the product of the categories with the unique values in the non categorical series)
creating the MutiIndex from the product would be consistent with the current sum behavior for the reversed level order.
again, what would be the expected output if the category Series includes an used category
In this last example, the column that gets "expanded" to include unused values is not categorical (the column titled "str" was defined in my first post as rowstr = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='str', name='str')
). Maybe internally it got converted to a categorical.
Time permitting, I'm going to have a look at this in more detail...
I discovered this in pandas
1.2.5
and I tested it in1.4.2
and1.3.1
. I haven't found a similar issue reported here. I'm using python 3.9.10Here's a "minimal" example: if I provide e.g. two rows to
pd.crosstab
, one of which is of categorical type, there's odd behaviour exclusively whenaggfunc=sum
ornp.sum
:This produces the odd-looking matrix:
Compare to the following three examples: