pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.77k stars 17.62k forks source link

Pivot Table Margins aren't correctly showing NaN values despite dummy variables #30351

Open pshMsGsAZL2n07ohWBCAGX9cC opened 4 years ago

pshMsGsAZL2n07ohWBCAGX9cC commented 4 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

fig = plt.figure()

data = [['Auto', 'A', 10], ['Auto', 'A', 2], ['Telco', 'B', 15], ['Manf', 'C'],
        ['Retail', 'A', ], ['Retail', 'B', ]] 
df = pd.DataFrame(data, columns = ['INDUSTRY', 'PRODUCT', 'COST'])

table_byindustry = pd.pivot_table(df, index=['INDUSTRY'], columns=['PRODUCT'], values='COST',
                                  fill_value=0, margins=True, margins_name='Total', aggfunc='count')
ax = sns.heatmap(table_byindustry, cmap='RdYlGn', linewidths=.1, annot=True, vmin=0, vmax=100, fmt="g", xticklabels=True)
ax.set_ylabel("Industry", fontsize=12)
ax.set_xlabel("Product", fontsize=12, labelpad=1)
ax.xaxis.tick_bottom()
ax.set_ylim(bottom=5,top=0)

Problem description

Pivot Table Margins ("Total") isn't showing 0 values for rows, yet will do so for columns (e.g., Product C). As a result, the heatmap is empty for the "Total" column related to the rows ('Manf' and 'Retail') in my example. It is treating those rows as NaN for Margin, and setting fill_value at 0 doesn't fix it. See attached image. Thank you for your efforts to fix this issue.

Expected Output

Based on the color map, there should be a color for every cell in the pivot table, including the Margins. In addition, if the value is 0, it should be shown when the margins=True. It seems as if the treatment is inconsistent (i.e., view the total for Product C vs. the totals for 'Manf' and 'Retail' in Industry).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 4.19.88-1-MANJARO machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.3 numpy : 1.17.4 pytz : 2019.3 dateutil : 2.8.1 pip : 19.3.1 setuptools : 42.0.2.post20191203 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : 0.4.0 xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : 7.10.2 pandas_datareader: None bs4 : 4.8.1 bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.1 numexpr : None odfpy : None openpyxl : 3.0.2 pandas_gbq : None pyarrow : 0.13.0 pytables : None s3fs : None scipy : 1.3.2 seaborn : 0.9.0 sqlalchemy : None tables : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : None

bug_image_19Dec2019

TomAugspurger commented 4 years ago

I'm not sure how the plotting stuff is related, but based on the title this sounds like https://github.com/pandas-dev/pandas/issues/3729. Let me know if not.

pshMsGsAZL2n07ohWBCAGX9cC commented 4 years ago

No, this issue is different from #3729. This issue isn’t dropping entire NaN rows like #3729, and my issue already uses the dummy variables suggested in that post as a workaround. My issue is dropping a NaN total in the margin row (but not column), while the dummy variables appear fine. I would suggest that you re-open it. Thanks again, and happy holidays.

pshMsGsAZL2n07ohWBCAGX9cC commented 4 years ago

Forgot to add in first clarification. This issue occurs without doing a heatmap plot. That is, the row margin total shows NaN, despite zeros for the row elements. Column margin total doesn’t have this issue.

pshMsGsAZL2n07ohWBCAGX9cC commented 4 years ago

I'm not sure how the plotting stuff is related, but based on the title this sounds like #3729. Let me know if not.

Tom, I added a clarification about a week ago, saying that I didn't think it was a duplicate. Can you please take a look? Apologies in advance, as I'm relatively new to github and don't understand exactly how issues are "re-opened". Thanks.

TomAugspurger commented 4 years ago

@pshMsGsAZL2n07ohWBCAGX9cC can you make a minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports Is seaborn / matplotlib necessary, or does just

In [2]: table_byindustry
Out[2]:
PRODUCT   A  B  C  Total
INDUSTRY
Auto      2  0  0    2.0
Manf      0  0  0    NaN
Retail    0  0  0    NaN
Telco     0  1  0    1.0
Total     2  1  0    3.0

display the issue? Is it that the total for Manf and Retail are NaN, rather than 0?

pshMsGsAZL2n07ohWBCAGX9cC commented 4 years ago

@TomAugspurger Seaborn and matplotlib aren't necessary, although that's how I first saw the problem. The dataframe (table_byindustry) shows the issue, which you correctly summarized. Total for the Manf and Retail rows are NaN, rather than 0. Thanks for re-opening it and sharing how I can better report future bugs with minimal information.

aoot commented 3 months ago

Seems like this behavior is due to margin calculation using the groupby() method and na are default dropped by groupby().

My workaround is to just use isna() on the pivot_table output.

pshMsGsAZL2n07ohWBCAGX9cC commented 3 months ago

@aoot Thanks for the workaround. At the time, I wanted a table that would properly show rows with zero values (e.g., Manf and Retail) rather than NaNs without the need to handle edge cases (e.g., use isna() to identify such rows then replace them with zeros). Looking back, I probably should have address the issue in data prep/cleansing before creating the table. However, my expectation was that pivot_table() and groupby() would be more Excel-like in their behavior with respect to this issue ;) Thanks again for sharing.