Open peisenha opened 7 years ago
The docstring even has an example further down
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> crosstab(foo, bar) # 'c' and 'f' are not represented in the data,
# but they still will be counted in the output
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
which is not what I get, so defiantly a bug somewhere. I suspect that https://github.com/pandas-dev/pandas/pull/15511/ may be related, since
In [14]: crosstab(foo, bar, dropna=False)
Out[14]:
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
does produce the correct output. crosstab
is defined in pandas/core/reshape/pivot.py
, if you want to start there.
Thanks a lot for the swift response. I will give it a try.
On Tue, May 16, 2017 at 2:52 PM, Tom Augspurger notifications@github.com wrote:
The docstring even has an example further down
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])>>> crosstab(foo, bar) # 'c' and 'f' are not represented in the data,
but they still will be counted in the output
col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0
which is not what I get, so defiantly a bug somewhere. I suspect that
15511 https://github.com/pandas-dev/pandas/pull/15511 may be related,
since
In [14]: crosstab(foo, bar, dropna=False) Out[14]: col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0
does produce the correct output. crosstab is defined in pandas/core/reshape/pivot.py, if you want to start there.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16367#issuecomment-301772313, or mute the thread https://github.com/notifications/unsubscribe-auth/ADcMJ6Was7MAnYZ3x3ITFPtYjAvz_GVGks5r6ZwOgaJpZM4Ncb6_ .
-- Philipp Eisenhauer Economist
Mail eisenhauer@policy-lab.org Web www.eisenhauer.io Repository https://github.com/peisenha
This is the relevant code in pivot_table()
# GH 15193 Makse sure empty columns are removed if dropna=True
if isinstance(table, DataFrame) and dropna:
table = table.dropna(how='all', axis=1)
If you agree, just let me know and I will be glad to adjust the documentation accordingly.
It seems like the resolution from https://github.com/pandas-dev/pandas/issues/12298 was that all the categories should be present in the output. https://github.com/pandas-dev/pandas/pull/15511 seems to go against that... So I think this is a regression and not just a doc issue.
I think the issue is that the meaning of dropna
becomes ambiguous when you have a Categorical. With a categorical that isn't observed, you're going to introduce NaNs by definition. I hate to make the dropna
argument to pivot_table
even more confusing, but I think its behavior should depend on the type of the index
or columns
. This may be tricky to do correctly, as columns
in pivot_table
can be a list :/
Alright, if you would like me to take a crack at it, let me know. I will be glad to provide a fix and a regression test ... As this would be my first contribution to the library, I will probably need some guidance in the process.
Yeah, it'd be great if you can take a shot.
But first, let's see if @jreback and @jorisvandenbossche agree that the documented version is correct, and that treating dropna
differently for Categoricals is a good idea.
Focusing on crosstab
for a moment, what is an example when you get a column with NaNs? (trying to understand the usecase of the dropna
method)
In this case, crosstab is a pivot_table
followed up by a fillna(0)
:
table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
aggfunc=len, margins=margins, dropna=dropna)
table = table.fillna(0).astype(np.int64)
so the change to pivot_table
's NaN handling affected this downstream.
So the dropna in crosstab is actually also a bit confusing explantion, as you are not dropping all NaN columns, but all 0 columns :-) But apart from that, I am still wondering when you get such a column. Because the columns are created from the values, so how is it possible (apart from using categoricals) that a value is not present?
so how is it possible (apart from using categoricals) that a value is not present?
I guess it can happen if you have multiple levels, some of which aren't observed:
In [21]: df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, 2, 1, 3], 'C': ['a', 'a', 'a', 'b']})
In [22]: pd.crosstab(df.A, [df.B, df.C], dropna=False)
Out[22]:
B 1 2 3
C a b a b a b
A
1 1 0 1 0 0 0
2 1 0 0 0 0 1
from the example above.
In [7]: pd.crosstab(df.A, [df.B, df.C], dropna=False)
Out[7]:
B 1 2 3
C a b a b a b
A
1 1 0 1 0 0 0
2 1 0 0 0 0 1
In [8]: pd.crosstab(df.A, [df.B, df.C], dropna=True)
Out[8]:
B 1 2 3
C a a b
A
1 1 1 0
2 1 0 1
I think the issue is that the meaning of dropna
is just confusing. maybe have strings instead?
are there 3 cases here that are useful?
Could a solution to this problem be to change the default of dropna
to None instead of True?
So if dropna=None
would then depend on the dtype: False for categorical, True for other dtypes.
Hey, what needs to be done for this, maybe I can give it a try?
Hello! I am interested to help in this issue.
I just tried this on master and got
>>> import pandas as pd
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> pd.crosstab(foo, bar)
col_0 d e
row_0
a 1 0
b 0 1
>>> pd.crosstab(foo, bar, dropna=False)
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
which seems correct and in accordance with the description given alongside the example in the docs.
The only part which strikes me as not correct is that the docs still read
Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.
So, if that line in the docs is changed to
When using
dropna=False
, any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.
then can we close the issue? Changing the default type of dropna
would be a breaking change, and I'm not sure it would be worth it
Code Sample, a copy-pastable example if possible
Problem description
This is from the documentation:
However, f is not included in the table while c is.
Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.
Thanks a lot in advance!
Expected Output
col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0
Output of
pd.show_versions()