mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.54k stars 1.92k forks source link

sns.lineplot produces confusing output #3765

Closed jtmbeta closed 3 weeks ago

jtmbeta commented 3 weeks ago

pipr.csv ss.csv

The following code produces lineplots with garbled confidence intervals and average traces despite no obvious issues with the data. Seaborn version 0.13.2, matplotlib 3.8.3.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pipr = pd.read_csv('pipr.csv')
ss = pd.read_csv('ss.csv')

fig, axs = plt.subplots(1, 2, figsize=(12,4),layout='tight')
sns.lineplot(
    data=pipr,
    x="time",
    y="pc_pupil",
    hue='condition',
    errorbar='ci',
    palette={'red': 'tab:red', 'blue': 'tab:blue'},
    ax=axs[0]
)

sns.lineplot(
    data=ss,
    x="time",
    y="pc_pupil",
    hue='condition',
    errorbar='ci',
    palette={'lms': 'tab:green', 'mel': 'tab:blue'},
    ax=axs[1]
)
for ax in axs:
    ax.set(xlabel="Time (s)", ylabel="Pupil size (%-change)")
    ax.fill_between(
        (0, 3), min(ax.get_ylim()), max(ax.get_ylim()), alpha=0.2, color="k"
    )
    ax.grid()
![Figure_2](https://github.com/user-attachments/assets/b0eafe42-d886-4c6c-be77-db9af65106f1)

Figure_2

mwaskom commented 3 weeks ago

Hi, not being familiar with your data, it's not obvious to me what is "wrong" here, or why it might be a seaborn problem.

jtmbeta commented 3 weeks ago

If I change the code to:

fig, axs = plt.subplots(1, 2, figsize=(12,4),layout='tight')
sns.lineplot(
    data=pipr,
    x="time",
    y="pc_pupil",
    hue='condition',
    errorbar='se',
    palette={'red': 'tab:red', 'blue': 'tab:blue'},
    ax=axs[0],
    units='subject',
    estimator=None
)

sns.lineplot(
    data=ss,
    x="time",
    y="pc_pupil",
    hue='condition',
    errorbar='se',
    palette={'lms': 'tab:green', 'mel': 'tab:blue'},
    ax=axs[1],
    units='subject',
    estimator=None
)
for ax in axs:
    ax.set(xlabel="Time (s)", ylabel="Pupil size (%-change)")
    ax.fill_between(
        (0, 3), min(ax.get_ylim()), max(ax.get_ylim()), alpha=0.2, color="k"
    )
    ax.grid()

We can now see the underlying traces in the data. I would expect a smooth trace for the average and shaded regions for the standard error. But the result is clearly wrong in the previous figure!

image

mwaskom commented 3 weeks ago

Your subjects do not have identical time values:

pipr.query("condition == 'blue'").pivot(index="time", columns="subject", values="pc_pupil")
subject           3005        3011        3012
time                                          
-1.000000   100.882803   99.998503  101.361294
-0.979978          NaN  100.059489  101.272379
-0.979978   100.883722         NaN         NaN
-0.959956          NaN  100.116923  101.190397
-0.959956   100.880034         NaN         NaN
...                ...         ...         ...
 16.959956         NaN   95.398760   90.926791
 16.959956   88.411231         NaN         NaN
 16.979978   88.535902         NaN         NaN
 16.979978         NaN   95.449669   91.013283
 17.000000   88.667938   95.502635   91.100839

sns.lineplot only aggregates across identical values of the x variable. If your data need some binning / interpolation, you'll have to do it yourself as a preprocessing step.

jtmbeta commented 3 weeks ago

Ahh, got it. Thank you!