mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.56k stars 1.92k forks source link

Incorrect plotting of exactly overlapping scatter with `hue` and `hue_order` #3728

Open eloyvallinaes opened 4 months ago

eloyvallinaes commented 4 months ago

While working with sns.scatterplot for representing locations on a grid, I discovered an issue where using hue and hue_order produces an incorrect plot: markers that should be perfectly overlapping—they have identical (x, y) coordinates—are drawn at a small offset, such that the edge of one can be seen intersecting the other. Here's a minimal example that reproduces the issue with matplotlib 3.9.1 and seaborn 0.13.2:

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df = pd.DataFrame.from_dict({
    'x': [6.3, 6.3, 6.3, 6.3, 6.633333, 6.633333, 6.633333, 6.633333, 33.48, 33.48, 33.48, 33.48, 33.813333, 33.813333, 33.813333, 33.813333],
    'y': [-12.42, -12.42, -4.0, -4.0, -12.42, -12.42, -4.0, -4.0, -12.42, -12.42, -4.0, -4.0, -12.42, -12.42, -4.0, -4.0],
    'locid': ['loc1', 'loc1', 'loc1', 'loc1', 'loc2', 'loc2', 'loc2', 'loc2', 'loc1', 'loc1', 'loc1', 'loc1', 'loc2', 'loc2', 'loc2', 'loc2']
})

sns.scatterplot(
    data=df,
    x='x',
    y='y',
    marker="o",
    hue='locid',
    hue_order=['loc1'],
)
print('Pandas version: ', pd.__version__)  # 2.2.2
print('Matplotlib version: ', matplotlib.__version__)  # 3.9.1
print('Seaborn version: ', sns.__version__)  # 0.13.2

That code produces the following plot: bugPlot where at each corner, the edge of the second marker is clearly seen to intersect the face of the first

From my brief dive into this problem:

  1. As in the example, it doesn't matter whether a tall stack of markers are made to overlap: there's only to points with the exact (6.3, -12.42) coordinates and the problem is there.
  2. The issue is seaborn-specific. Using matplotlib's plt.scatter does yield a correct plot.
  3. Both hue and hue_order need to be used in order for the issue to appear. Slicing the data with df[df.locid == 'loc1'] makes a correct plot.
  4. The problem persists even with marker='.', marker='s', marker='v' and marker='d', but not with marker='x'.
mwaskom commented 4 months ago

I don't think anything is wrong with the position things are plotted in here. Rather, using hue_order in scatteplot doesn't suppress the datapoints from the plot, but it does cause the facecolor to be null. So you're seeing the edges from the loc2 points. I could have sworn there was an issue about this already but couldn't find it quickly.

mwaskom commented 4 months ago

Oh here it is https://github.com/mwaskom/seaborn/issues/3601

eloyvallinaes commented 4 months ago

Ah! You're right of course! 😄 Thanks!

I took a look at relational._ScatterPlotter.plot and thought some logic could be added to fix this problem, so here's a pull request #3730. I assumed the intended behaviour is to have transparent edges whenever hue_order has made a face transparent, while preserving whatever edgecolor (white is default) was passed to the plot method.

It's a bit of a patch but it covers all the use cases I could think of.