pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.26k stars 17.79k forks source link

BUG: stacked bar graphs show invalid label position due to invalid rectangle bottom when data is 0 #59429

Open KinzigFlyer opened 1 month ago

KinzigFlyer commented 1 month ago

Pandas version checks

Reproducible Example

import pandas as pd
import matplotlib.pyplot as plt
permutations = [(a,b,c) for a in range(2) for b in range(2) for c in range(3)]
data = [
    {'i': i, 'a':a, 'b':b, 'c':c, 't': a+b+c}
    for i, (a,b,c) in enumerate(permutations)
]
df = pd.DataFrame.from_dict(data)
ax = df[['a','b', 'c']].plot.bar(stacked=True)
bl = ax.bar_label(ax.containers[-1], df['t'])
plt.show()

Issue Description

if the top part of the stacked plot has data value 0, the bar-label does not appear on top, but at the bottom of the bar. grafik

Further debugging shows that all bars with data = 0 have their y position set to 0.0. They should have the top of the bar below as their bottom = y.

Expected Behavior

Bar-Labels should be positioned on top for all stacks.

grafik

this behaviour can be produced by correcting the y positions of the defective bars

def correct_stack(container, info=False):
    """ correct the y positions of stacked bars with 0 height

    This is needed because the y position is calculated wrongly when data value is 0 on stacked bars created by Pandas plot.bar.
    """
    # Attention, since we start at row 1, r shows to the row below - which we need
    for r, row in enumerate(container[1:]):
        for b, bar in enumerate(row):
            (my_x, my_y), my_height = bar.xy, bar.get_height()
            # note that r show to the bar below the current bar, and c is the stack
            support = container[r][b]    # this is the bar we are resting on top of
            (s_x, s_y), s_height = support.xy, support.get_height()
            if info:
                print(f"bar at row: {r+1}, col: {b}: ({my_x}, {my_y}) - {my_height} resting on top of ({s_x, s_y}) - {s_height}")
            if my_y < s_y + s_height:
                print(f"bar at row: {r+1}, col: {b}: {my_y = } is lower than expected {s_y + s_height}")
                bar.xy = (my_x, s_y + s_height)

ax2 = df[['a','b','c']].plot.bar(stacked=True)
correct_stack(ax2.containers)
bl = ax2.bar_label(ax2.containers[-1], df['t'])
plt.show()

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.cp1252

pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 65.5.0 pip : 24.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.9.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

KevsterAmp commented 1 month ago

take

KinzigFlyer commented 1 month ago

Thanks @KevsterAmp for looking into this

KevsterAmp commented 1 month ago

All good, just waiting for this issue to be triaged by a maintainer before working on it.

KevsterAmp commented 1 month ago

@KinzigFlyer - seems like this is a problem on matplotlib not on the pandas itself

KinzigFlyer commented 1 month ago

I don't think so, Matplotlib does not provide stacking on it's own. You create a stacked bar by providing "bottom" parameter to the bars. So I think bottom is calculated inside Pandas. see this page in the official matplotlib documentation: Stacked Bar charts

KevsterAmp commented 1 month ago

Good point, thank you

KinzigFlyer commented 1 month ago

I took the official programm, changed one of the Above values to 0 and added the bar-label. Works correctly.

import matplotlib.pyplot as plt
import numpy as np

# data from https://allisonhorst.github.io/palmerpenguins/

species = (
    "Adelie\n $\\mu=$3700.66g",
    "Chinstrap\n $\\mu=$3733.09g",
    "Gentoo\n $\\mu=5076.02g$",
)
weight_counts = {
    "Below": np.array([70, 31, 58]),
    "Above": np.array([82, 0, 66]),
}
width = 0.5

fig, ax = plt.subplots()
bottom = np.zeros(3)

for boolean, weight_count in weight_counts.items():
    p = ax.bar(species, weight_count, width, label=boolean, bottom=bottom)
    bottom += weight_count

ax.bar_label(p, weight_counts['Above'])
ax.set_title("Number of penguins with above average body mass")
ax.legend(loc="upper right")

plt.show()
image
KinzigFlyer commented 1 month ago

Converting the official example to a Pandas driven version shows the error:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# data from https://allisonhorst.github.io/palmerpenguins/
penguins = pd.DataFrame.from_dict([
    {'species': "Adelie", 'Below': 70.0, 'Above': 82.0},
    {'species': "Chinstrap", 'Below': 31.0, 'Above': 0.0},
    {'species': "Gentoo", 'Below': 58.0, 'Above': 66.0},
]).set_index('species')
width = 0.5

ax2 = penguins.plot.bar(stacked = True)

ax2.bar_label(ax2.containers[-1], penguins['Above'])
ax2.set_title("Number of penguins with above average body mass")
ax2.legend(loc="upper right")

plt.show()
image
rhshadrach commented 1 month ago

Thanks for the report - PRs to fix are welcome!