mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.55k stars 1.92k forks source link

Displot histogram overlapping despite multiple="stack" when common_bins=False #3232

Open aufildelanuit opened 1 year ago

aufildelanuit commented 1 year ago

I have been trying to split what would have been a very long histplot into several columns using displot. When all columns share the same y axis (which, in this use case, is not really suitable) or when common_bins=True, all bins are properly generated. However, passing the parameter common_bins=False (to make sure each column will not display the entire y axis) results in what seems to be a glitch in the drawing of the bins, with some colors overlapping others for reasons that I fail to understand.

Here is a MWE (a bit long, sorry):

import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import pandas
import numpy

matplotlib.use("webagg")

### creating some data simulating a survey (responders and choices)
choices_list = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
responders_list = ['ax', 'bx', 'cx', 'dx', 'ex', 'fx', 'gx', 'hx']

### creating DataFrame
choices = list(numpy.hstack([numpy.repeat(choices_list[i], len(choices_list)-i) for i in range(0, len(choices_list))]))
responders = list(numpy.tile(responders_list, 5))[0:len(choices)]
data = pandas.DataFrame({"choice": choices, "responder": responders})

### getting choices in order of frequency
choices_freq_sorted = list(data["choice"].value_counts().index)
data["choice"] = pandas.Categorical(data["choice"], categories=choices_freq_sorted, ordered=True)

### also putting responders in order
data["responder"] = pandas.Categorical(data["responder"], categories=responders_list, ordered=True)

### Adding an indicator for splitting the data into two columns based on frequency count
data["split"] = numpy.nan
split_group_size = int(numpy.ceil(len(data["choice"].unique())/2)) 
for i in range(1, data["choice"].count()-1, split_group_size):
    data.loc[data["choice"].isin(choices_freq_sorted[i-1:i+split_group_size]), "split"] = f"{i}-{i+split_group_size-1}"
# end of for loop
data["split"] = pandas.Categorical(data["split"])

### graph
graph1 = sns.displot(data=data, y="choice", col="split", col_wrap=2, height=8, aspect=0.5, multiple="stack", hue="responder", palette=sns.color_palette("hls", 8), alpha=0.6, common_bins=True, discrete=True, facet_kws=dict(sharey=False, sharex=True))
graph1.fig.tight_layout(rect=[0, 0, 1, 0.95])
graph1.fig.suptitle("common_bins=True", size=14, y=0.97)

graph2 = sns.displot(data=data, y="choice", col="split", col_wrap=2, height=4, aspect=0.9, multiple="stack", hue="responder", palette=sns.color_palette("hls", 8), alpha=0.6, common_bins=False, discrete=True, facet_kws=dict(sharey=False, sharex=True))
graph2.fig.tight_layout(rect=[0, 0, 1, 0.91])
graph2.fig.suptitle("common_bins=False", size=14, y=0.95)

plt.show()

In the first graph generated with common_bins=True , the coloring seems fine, but sharey=False is ineffective.

mwe_graph1

In the second graph generated with common_bins=False, sharey=False is taken into account, but a strange overlap appears for the bins of "D" and "F".

mwe_graph2

mwaskom commented 1 year ago

My guess would be that this is a floating point issue: bars are stacked when they have exactly the same position and, due to floating point errors, that won't necessarily be the case for bars that look like the should cover the same range if the bins were computed independently.

You could always keep common_bins=True and then set the axis limits of the plot to focus on the data.

aufildelanuit commented 1 year ago

Thanks for the suggestion.

In fact, I also thought about playing with the axis limits, but I am a bit unfamiliar with setting limits for categorical data.

So far my workaround has been to perform the data split manually (create two copies of the original DataFrame with specific filters) and use two histplot within a matplotlib subplots.

def split_histplot(data, y, hue="responder", palette="flare", multiple="stack", split="split"):
    fig, ax = plt.subplots(ncols=2, figsize=(10, 10))

    ### this should sort hue categories if originally unsorted, but should preserve already sorted categories...
    data[hue] = pandas.Categorical(data[hue], categories=list(pandas.Categorical(data[hue]).categories), ordered=True)

    graph1 = sns.histplot(data=data[data[split]==pandas.Categorical(data[split]).categories[0]], y=y, multiple=multiple, hue=hue, palette=palette, alpha=0.7, common_bins=True, discrete=True, ax=ax[0], legend=False)
    graph2 = sns.histplot(data=data[data[split]==pandas.Categorical(data[split]).categories[1]], y=y, multiple=multiple, hue=hue, palette=palette, alpha=0.7, common_bins=True, discrete=True, ax=ax[1], legend=True)

    sns.despine(fig=fig, top=True, right=True, left=True, bottom=True)

    graph1.set(title=f"Rank: [{pandas.Categorical(data[split]).categories[0]}]")
    graph2.set(title=f"Rank: [{pandas.Categorical(data[split]).categories[1]}]")

    sns.move_legend(graph2, "lower right", ncols=legend_ncols, bbox_to_anchor=(0.97, 0.03), frameon=True)

    return fig

I was just wondering if this behaviour could be fixed at seaborn's level.

aufildelanuit commented 1 year ago

One more thing I noticed, by the way, is that the overlapping I described seems to always appear at the same place in the histogram if the generating script is run several times.

In the MWE I provided, the overlapping area remains the same regardless of the sorting of the DataFrame, but a real-life example seemed to be sensitive to data sorting.

Would it be possible for this kind of "regularity" to be consistent with a floating point issue?

mwaskom commented 1 year ago

Floating point error is not stochastic, so just repeatedly running the script several times and seeing the same output would not be surprising.

mwaskom commented 1 year ago

FWIW while I can reproduce the issue when i run your code, the "MWE" remains much too complicated to play around with to try to identify any other hypotheses. If you could reduce it to a simpler example that reproduces the issue then perhaps it could be possible to dig further, but as is this presents as an extreme edge case.

aufildelanuit commented 1 year ago

Here is a much lighter MWE

data = pandas.DataFrame({
    'choice': ['C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E', 'F', 'F', 'F'],
    'responder': ['hx', 'ax', 'bx', 'cx', 'dx', 'ex', 'fx', 'gx', 'hx', 'ax', 'bx', 'cx', 'dx', 'ex', 'fx', 'gx', 'hx', 'ax'],
    'split': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2],
})

sns.displot(data=data, y="choice", col="split", multiple="stack", hue="responder", common_bins=False)

That code yields the following for me:

image

I don't know whether this is an extreme edge case or something more subtle, but if you intend to investigate it further, I'll be glad to help.

(Note that I removed sharey=False from the MWE as this is not absolutely required to reproduce the issue, but the point in using common_bins=False is originally to allow the two columns not to share the same y axis).

aufildelanuit commented 1 year ago

Not sure whether it helps, but I just noticed that if you replace the values for the y axis by numbers (replaced C by 30, D by 40, E by 50 and F by 60), this happens:

data = pandas.DataFrame({
    'choice': ([30]*6 + [40]*5 + [50]*4 + [60]*3),
    'responder': ['hx', 'ax', 'bx', 'cx', 'dx', 'ex', 'fx', 'gx', 'hx', 'ax', 'bx', 'cx', 'dx', 'ex', 'fx', 'gx', 'hx', 'ax'],
    'split': ([1]*(6+5) + [2]*(4+3)),
})

sns.displot(data=data, y="choice", col="split", multiple="stack", hue="responder", common_bins=False, facet_kws=dict(sharey=False))

image

60 still has the same overlapping in the right column, but bins in the left column are behaving strangely.

mwaskom commented 1 year ago

Thanks, this is much easier to work with. In fact it can be reduced even further:

data = pandas.DataFrame({
    'pos': ['A', 'A', 'A', 'B', 'B'],
    'grp': ['a', 'b', 'c', 'd', 'a'],
})

sns.displot(data=data, y="pos", multiple="stack", hue="grp", common_bins=False)

It's not surprising to see "weird" behavior with numeric data; the default binwidth is dependent on a calculation based on measures of variance which are indeed different for your datasets, and they're only expected to work well for data that are distributed at least sort of normally.

aufildelanuit commented 1 year ago

I see. I also figured out that there was a matter of binwidth after I checked the code in distributions.py and _oldcore.py a bit... The "y" variable is not taken into account in the groupby that generates sub_vars and sub_data for the histogram generation loop, so data with same group (hue) and same column are passed simultaneously to the estimanor, that yields a single bin with binwidth large enough to accomodate everything.

This being said, when the data for the y axis is non-numerical, the "y" value still seems to be internally seen as [0, 1, 2...] (if I am not mistaken) . So I wonder if the calculation of binwidth or the generation of sub_data could have something to do with the overlapping problem... However, not everything having same "hue" and different "y" overlapped in my very first example, so this might as well just be misleading.

Another possible pattern could be that overlaps seem to appear when the data from two groups are stacked on the same bin, but the groups are not adjacent ones (e.g. in your reduced example, when something from group 'a' is stacked together with something from group 'd' and there is no element from groups 'b' and 'c' in that same bin).

It is also interesting to see that it can happen without having to split the histogram into two columns.