pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.86k stars 18.01k forks source link

BUG: margin=True is affected by NaN in other columns in pivot_table #33466

Open nikhilpatwardhan opened 4 years ago

nikhilpatwardhan commented 4 years ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

>>> import pandas as pd
>>> data = {'A': [6, 7, None], 'B': [None, 5, 8], 'name': ['Alice', 'Alice', 'Alice']}
>>> df = pd.DataFrame(data)
>>> df.pivot_table(index='name', values=['A', 'B'], aggfunc=['min', 'max'], margins=True)

Output:

       min       max     
         A    B    A    B
name                     
Alice  6.0  5.0  7.0  8.0
All    7.0  5.0  7.0  5.0

Problem description

The summary line 'All' shows min for A as 7.0 and max for B as 5.0 although min of A is 6.0 and max of B is 8.0

Expected Output

       min       max     
         A    B    A    B
name                     
Alice  6.0  5.0  7.0  8.0
All    6.0  5.0  7.0  8.0

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 6bc8a49f2595daca6f6d0391020981d536747b23 python : 3.8.2.final.0 python-bits : 64 OS : Darwin OS-release : 19.3.0 Version : Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.0.dev0+1225.g6bc8a49f2 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 46.1.3.post20200325 Cython : 0.29.16 pytest : 5.4.1 hypothesis : 5.8.0 sphinx : 3.0.1 blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : 4.9.0 bottleneck : 1.3.2 fastparquet : 0.3.3 gcsfs : None matplotlib : 3.2.1 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : 0.16.0 pytables : None pyxlsb : None s3fs : 0.4.2 scipy : 1.4.1 sqlalchemy : 1.3.16 tables : 3.6.1 tabulate : 0.8.7 xarray : 0.15.1 xlrd : 1.2.0 xlwt : 1.3.0 numba : 0.48.0
dsaxton commented 4 years ago

This looks to be the source of the bug (it's dropping rows with any missing data after calculating the pivot table but before getting the margins): https://github.com/pandas-dev/pandas/blob/d106b81ce532bc71ec6cced944ddb751a4b0e5a3/pandas/core/reshape/pivot.py#L159

which was itself added to fix a bug in crosstab: https://github.com/pandas-dev/pandas/pull/12614. So somehow this would have to be fixed without breaking crosstab.

theavey commented 2 years ago

Still exists in pandas 1.4.3

>>> df = pd.DataFrame({"a": [1, 2, None, 4], "b": ["a", "a", "a", "b"], "c": [1] * 4})
>>> df
    a   b   c
0   1.00    a   1
1   2.00    a   1
2   NaN a   1
3   4.00    b   1
>>> df.pivot_table(index="b", values=["a", "c"], aggfunc="sum", margins=True)
    a   c
b       
a   3.00    3
b   4.00    1
All 7.00    3

Would instead expect

>>> df.pivot_table(index="b", values=["a", "c"], aggfunc="sum", margins=True)
    a   c
b       
a   3.00    3
b   4.00    1
All 7.00    4