Open rhshadrach opened 1 year ago
Instead of adding cartesian_product
, we could make observed=False
take the cartesian product of both the index
and the column
, regardless of whether they are categorical. In my opinion, this would only be okay to do if #55261 gets implemented.
Doing this would mean more than just passing observed=False
through to groupby - that would only handle the index. We would still need to take the cartesian product of the columns in the case they are a MultiIndex as is done with dropna=False
today.
If we're going this route, then I think we should also adhere to groupby semantics with unobserved groupings for various ops. For example:
df = pd.DataFrame(
{
'idx1': 1,
'idx2': [2, 3],
'col1': 4,
'col2': [5, 6],
'val1': [7, 8],
'val2': [9, 10],
}
)
df.pivot_table(index=['idx1', 'idx2'], columns=['col1', 'col2'], values=['val1', 'val2'], aggfunc='sum', dropna=False)
currently results in NaN
values in various locations; instead with observed=False
it should be 0
because the specified aggfunc is sum.
Currently
dropna
is used in four places withinDataFrame.pivot_table
:1, 2, and 4 were all implemented for crosstab, which is essentially a call to pivot_table.
The API docs for crosstab document the
dropna
argument as:The only other documentation in the API and User Guide mentions using
dropna=False
to include rows/columns for categorical data with missing categorical values.I think this is too much for a single Boolean argument to handle. I propose the following:
a. Add
cartesian_product=[True|False]
to pivot_table and crosstab b. Addobserved=[True|False]
to crosstab for use with categoricals c. Deprecate behavior (1) (with dropna), (3), and (4) above. The user may do each of these by dropping null values from the input data if they so desire.We can implement (c) without affecting the behavior of crosstab by changing the data there to be a mixture of null/non-null values depending on the input and using the aggregation
count
instead oflen
.