Open aufildelanuit opened 1 year ago
My guess would be that this is a floating point issue: bars are stacked when they have exactly the same position and, due to floating point errors, that won't necessarily be the case for bars that look like the should cover the same range if the bins were computed independently.
You could always keep common_bins=True
and then set the axis limits of the plot to focus on the data.
Thanks for the suggestion.
In fact, I also thought about playing with the axis limits, but I am a bit unfamiliar with setting limits for categorical data.
So far my workaround has been to perform the data split manually (create two copies of the original DataFrame with specific filters) and use two histplot
within a matplotlib subplots
.
def split_histplot(data, y, hue="responder", palette="flare", multiple="stack", split="split"):
fig, ax = plt.subplots(ncols=2, figsize=(10, 10))
### this should sort hue categories if originally unsorted, but should preserve already sorted categories...
data[hue] = pandas.Categorical(data[hue], categories=list(pandas.Categorical(data[hue]).categories), ordered=True)
graph1 = sns.histplot(data=data[data[split]==pandas.Categorical(data[split]).categories[0]], y=y, multiple=multiple, hue=hue, palette=palette, alpha=0.7, common_bins=True, discrete=True, ax=ax[0], legend=False)
graph2 = sns.histplot(data=data[data[split]==pandas.Categorical(data[split]).categories[1]], y=y, multiple=multiple, hue=hue, palette=palette, alpha=0.7, common_bins=True, discrete=True, ax=ax[1], legend=True)
sns.despine(fig=fig, top=True, right=True, left=True, bottom=True)
graph1.set(title=f"Rank: [{pandas.Categorical(data[split]).categories[0]}]")
graph2.set(title=f"Rank: [{pandas.Categorical(data[split]).categories[1]}]")
sns.move_legend(graph2, "lower right", ncols=legend_ncols, bbox_to_anchor=(0.97, 0.03), frameon=True)
return fig
I was just wondering if this behaviour could be fixed at seaborn's level.
One more thing I noticed, by the way, is that the overlapping I described seems to always appear at the same place in the histogram if the generating script is run several times.
In the MWE I provided, the overlapping area remains the same regardless of the sorting of the DataFrame, but a real-life example seemed to be sensitive to data sorting.
Would it be possible for this kind of "regularity" to be consistent with a floating point issue?
Floating point error is not stochastic, so just repeatedly running the script several times and seeing the same output would not be surprising.
FWIW while I can reproduce the issue when i run your code, the "MWE" remains much too complicated to play around with to try to identify any other hypotheses. If you could reduce it to a simpler example that reproduces the issue then perhaps it could be possible to dig further, but as is this presents as an extreme edge case.
Here is a much lighter MWE
data = pandas.DataFrame({
'choice': ['C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E', 'F', 'F', 'F'],
'responder': ['hx', 'ax', 'bx', 'cx', 'dx', 'ex', 'fx', 'gx', 'hx', 'ax', 'bx', 'cx', 'dx', 'ex', 'fx', 'gx', 'hx', 'ax'],
'split': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2],
})
sns.displot(data=data, y="choice", col="split", multiple="stack", hue="responder", common_bins=False)
That code yields the following for me:
I don't know whether this is an extreme edge case or something more subtle, but if you intend to investigate it further, I'll be glad to help.
(Note that I removed sharey=False
from the MWE as this is not absolutely required to reproduce the issue, but the point in using common_bins=False
is originally to allow the two columns not to share the same y axis).
Not sure whether it helps, but I just noticed that if you replace the values for the y axis by numbers (replaced C by 30, D by 40, E by 50 and F by 60), this happens:
data = pandas.DataFrame({
'choice': ([30]*6 + [40]*5 + [50]*4 + [60]*3),
'responder': ['hx', 'ax', 'bx', 'cx', 'dx', 'ex', 'fx', 'gx', 'hx', 'ax', 'bx', 'cx', 'dx', 'ex', 'fx', 'gx', 'hx', 'ax'],
'split': ([1]*(6+5) + [2]*(4+3)),
})
sns.displot(data=data, y="choice", col="split", multiple="stack", hue="responder", common_bins=False, facet_kws=dict(sharey=False))
60 still has the same overlapping in the right column, but bins in the left column are behaving strangely.
Thanks, this is much easier to work with. In fact it can be reduced even further:
data = pandas.DataFrame({
'pos': ['A', 'A', 'A', 'B', 'B'],
'grp': ['a', 'b', 'c', 'd', 'a'],
})
sns.displot(data=data, y="pos", multiple="stack", hue="grp", common_bins=False)
It's not surprising to see "weird" behavior with numeric data; the default binwidth is dependent on a calculation based on measures of variance which are indeed different for your datasets, and they're only expected to work well for data that are distributed at least sort of normally.
I see. I also figured out that there was a matter of binwidth
after I checked the code in distributions.py
and _oldcore.py
a bit... The "y"
variable is not taken into account in the groupby
that generates sub_vars
and sub_data
for the histogram generation loop, so data with same group (hue) and same column are passed simultaneously to the estimanor, that yields a single bin with binwidth
large enough to accomodate everything.
This being said, when the data for the y axis is non-numerical, the "y"
value still seems to be internally seen as [0, 1, 2...] (if I am not mistaken) . So I wonder if the calculation of binwidth
or the generation of sub_data
could have something to do with the overlapping problem... However, not everything having same "hue"
and different "y"
overlapped in my very first example, so this might as well just be misleading.
Another possible pattern could be that overlaps seem to appear when the data from two groups are stacked on the same bin, but the groups are not adjacent ones (e.g. in your reduced example, when something from group 'a' is stacked together with something from group 'd' and there is no element from groups 'b' and 'c' in that same bin).
It is also interesting to see that it can happen without having to split the histogram into two columns.
I have been trying to split what would have been a very long
histplot
into several columns usingdisplot
. When all columns share the same y axis (which, in this use case, is not really suitable) or whencommon_bins=True
, all bins are properly generated. However, passing the parametercommon_bins=False
(to make sure each column will not display the entire y axis) results in what seems to be a glitch in the drawing of the bins, with some colors overlapping others for reasons that I fail to understand.Here is a MWE (a bit long, sorry):
In the first graph generated with
common_bins=True
, the coloring seems fine, butsharey=False
is ineffective.In the second graph generated with
common_bins=False
,sharey=False
is taken into account, but a strange overlap appears for the bins of "D" and "F".