pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.79k stars 17.98k forks source link

pd.crosstab, categorical data and missing instances #16367

Open peisenha opened 7 years ago

peisenha commented 7 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
pd.crosstab(foo, bar)

col_0  d  e
row_0      
a      1  0
b      0  1
c      0  0

Problem description

This is from the documentation:

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

However, f is not included in the table while c is.

Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.

Thanks a lot in advance!

Expected Output

col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-49-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.1 pytest: 2.8.7 pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.12.1 scipy: 0.17.0 xarray: None IPython: None sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 1.5.1 openpyxl: 2.3.0 xlrd: 0.9.4 xlwt: 0.7.5 xlsxwriter: None lxml: 3.5.0 bs4: 4.4.1 html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None
TomAugspurger commented 7 years ago

The docstring even has an example further down

>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> crosstab(foo, bar)  # 'c' and 'f' are not represented in the data,
                        # but they still will be counted in the output
col_0  d  e  f
row_0
a      1  0  0
b      0  1  0
c      0  0  0

which is not what I get, so defiantly a bug somewhere. I suspect that https://github.com/pandas-dev/pandas/pull/15511/ may be related, since

In [14]: crosstab(foo, bar, dropna=False)
Out[14]:
col_0  d  e  f
row_0
a      1  0  0
b      0  1  0
c      0  0  0

does produce the correct output. crosstab is defined in pandas/core/reshape/pivot.py, if you want to start there.

peisenha commented 7 years ago

Thanks a lot for the swift response. I will give it a try.

On Tue, May 16, 2017 at 2:52 PM, Tom Augspurger notifications@github.com wrote:

The docstring even has an example further down

foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])>>> crosstab(foo, bar) # 'c' and 'f' are not represented in the data,

but they still will be counted in the output

col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0

which is not what I get, so defiantly a bug somewhere. I suspect that

15511 https://github.com/pandas-dev/pandas/pull/15511 may be related,

since

In [14]: crosstab(foo, bar, dropna=False) Out[14]: col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0

does produce the correct output. crosstab is defined in pandas/core/reshape/pivot.py, if you want to start there.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16367#issuecomment-301772313, or mute the thread https://github.com/notifications/unsubscribe-auth/ADcMJ6Was7MAnYZ3x3ITFPtYjAvz_GVGks5r6ZwOgaJpZM4Ncb6_ .

-- Philipp Eisenhauer Economist

Mail eisenhauer@policy-lab.org Web www.eisenhauer.io Repository https://github.com/peisenha

peisenha commented 7 years ago

15193 and #15511 are two related issues. Looking at the source code and the discussion, it seems to me that dropping empty columns is the desired behavior for dropna=True (default).

This is the relevant code in pivot_table()

    # GH 15193 Makse sure empty columns are removed if dropna=True
    if isinstance(table, DataFrame) and dropna:
        table = table.dropna(how='all', axis=1)

If you agree, just let me know and I will be glad to adjust the documentation accordingly.

TomAugspurger commented 7 years ago

It seems like the resolution from https://github.com/pandas-dev/pandas/issues/12298 was that all the categories should be present in the output. https://github.com/pandas-dev/pandas/pull/15511 seems to go against that... So I think this is a regression and not just a doc issue.

I think the issue is that the meaning of dropna becomes ambiguous when you have a Categorical. With a categorical that isn't observed, you're going to introduce NaNs by definition. I hate to make the dropna argument to pivot_table even more confusing, but I think its behavior should depend on the type of the index or columns. This may be tricky to do correctly, as columns in pivot_table can be a list :/

peisenha commented 7 years ago

Alright, if you would like me to take a crack at it, let me know. I will be glad to provide a fix and a regression test ... As this would be my first contribution to the library, I will probably need some guidance in the process.

TomAugspurger commented 7 years ago

Yeah, it'd be great if you can take a shot.

But first, let's see if @jreback and @jorisvandenbossche agree that the documented version is correct, and that treating dropna differently for Categoricals is a good idea.

jorisvandenbossche commented 7 years ago

Focusing on crosstab for a moment, what is an example when you get a column with NaNs? (trying to understand the usecase of the dropna method)

TomAugspurger commented 7 years ago

In this case, crosstab is a pivot_table followed up by a fillna(0):

        table = df.pivot_table('__dummy__', index=rownames, columns=colnames,
                               aggfunc=len, margins=margins, dropna=dropna)
        table = table.fillna(0).astype(np.int64)

so the change to pivot_table's NaN handling affected this downstream.

jorisvandenbossche commented 7 years ago

So the dropna in crosstab is actually also a bit confusing explantion, as you are not dropping all NaN columns, but all 0 columns :-) But apart from that, I am still wondering when you get such a column. Because the columns are created from the values, so how is it possible (apart from using categoricals) that a value is not present?

TomAugspurger commented 7 years ago

so how is it possible (apart from using categoricals) that a value is not present?

I guess it can happen if you have multiple levels, some of which aren't observed:

In [21]: df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, 2, 1, 3], 'C': ['a', 'a', 'a', 'b']})

In [22]: pd.crosstab(df.A, [df.B, df.C], dropna=False)
Out[22]:
B  1     2     3
C  a  b  a  b  a  b
A
1  1  0  1  0  0  0
2  1  0  0  0  0  1
jreback commented 7 years ago

from the example above.

In [7]: pd.crosstab(df.A, [df.B, df.C], dropna=False)
Out[7]: 
B  1     2     3   
C  a  b  a  b  a  b
A                  
1  1  0  1  0  0  0
2  1  0  0  0  0  1

In [8]: pd.crosstab(df.A, [df.B, df.C], dropna=True)
Out[8]: 
B  1  2  3
C  a  a  b
A         
1  1  1  0
2  1  0  1

I think the issue is that the meaning of dropna is just confusing. maybe have strings instead? are there 3 cases here that are useful?

jorisvandenbossche commented 6 years ago

Could a solution to this problem be to change the default of dropna to None instead of True? So if dropna=None would then depend on the dtype: False for categorical, True for other dtypes.

AbhayGoyal commented 3 years ago

Hey, what needs to be done for this, maybe I can give it a try?

Yoshishiro commented 3 years ago

Hello! I am interested to help in this issue.

MarcoGorelli commented 3 years ago

I just tried this on master and got

>>> import pandas as pd
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> pd.crosstab(foo, bar)
col_0  d  e
row_0      
a      1  0
b      0  1
>>> pd.crosstab(foo, bar, dropna=False)
col_0  d  e  f
row_0         
a      1  0  0
b      0  1  0
c      0  0  0

which seems correct and in accordance with the description given alongside the example in the docs.

The only part which strikes me as not correct is that the docs still read

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

So, if that line in the docs is changed to

When using dropna=False, any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

then can we close the issue? Changing the default type of dropna would be a breaking change, and I'm not sure it would be worth it