mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.51k stars 1.92k forks source link

Smaller mark width for overlapping data with `so.Hist(common_bins=False)` #3769

Open maurosilber opened 1 week ago

maurosilber commented 1 week ago

When doing a so.Hist(common_bins=False), if the bins for each group overlap, the width calculated for each mark is smaller that it should be.

Here's a minimal working example, where I have a dataset A, and its x-shifted version B = A + shift. In each row, I'm plotting a different shift, and when they start overlapping, the bar width is smaller than the bin width.

image

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn.objects as so

def plot(ax, shift: float):
    data = np.random.default_rng(0).normal(size=50)
    df = pd.DataFrame({"A": data, "B": data + shift}).melt()
    return (
        so.Plot(df, x="value", color="variable")
        .add(so.Bars(), so.Hist(common_bins=False))
        .on(ax)
        .plot()
    )

shifts = [4.5, 4, 3.9, 3.8]
fig, axes = plt.subplots(len(shifts), sharex=True, gridspec_kw={"hspace": 0})
for ax, shift in zip(axes, shifts):
    plot(ax, shift)
    ax.set(ylabel=f"{shift = }")

I could trace it to this width calculation: https://github.com/mwaskom/seaborn/blob/b4e5f8d261d6d5524a00b7dd35e00a40e4855872/seaborn/_core/plot.py#L1453 which ends up running the following line for all groups as one: https://github.com/mwaskom/seaborn/blob/b4e5f8d261d6d5524a00b7dd35e00a40e4855872/seaborn/_core/scales.py#L467

If the bin edges are [0, 1, 2] and [0.5, 1.5, 2.5] for each group, it calculates the bin width from [0, 0.5, 1, 1.5, ...] and finds a width of 0.5 instead of a width of 1.

Maybe this is not a bug but something by design when there is overlap between marks?

In case it is a bug, I could contribute a fix, but would probably need some direction as to where to fix it.

Thanks!

juhabae commented 2 days ago

can I take it ?