mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.57k stars 1.93k forks source link

Mean and Standard Deviation is Different on Point Plot Log Scale #3661

Open gil2rok opened 7 months ago

gil2rok commented 7 months ago

Problem Description: I have multiple measurements of some cost, with most values being quite small, but I have some enormous outlier: my mean is $1.2$, my standard deviation is just under $10$, and my median is $0.003$.

When I plot the mean and std with a point-plot, my error bar correctly ranges from approximately $(-10, 10)$ with a mean of $1$.

image

But when I use the log scale, the standard deviation and mean shift!

The mean is located near $10^{-2} = 0.1$ instead of $1$. The standard deviation error bars range from $(10^{-3}, 10^{-1}) = (0.01, 1)$ instead of $(-10, 10)$.

image

Question: Why is this? Are these statistics computed differently in log space? A big red flag is that the standard deviation error bars are symmetric in log space. Another red flag is that the error bars no longer go past zero when they absolutely should.

Code: Here is the code I used to generate these two plots. The only difference is toggling the log_scale parameter between true and false.

fig = sns.catplot(
    data=tmp,
    kind="point",
    x="sampler_type",
    y="cost",
    estimator=np.mean,
    errorbar="sd",
    aspect=1.5,
    log_scale=False,
)
fig = sns.catplot(
    data=tmp,
    kind="point",
    x="sampler_type",
    y="cost",
    estimator=np.mean,
    errorbar="sd",
    aspect=1.5,
    log_scale=True,
)
mwaskom commented 7 months ago

Hi, yes the statistics are computed in log space when you have log_scale=True.

gil2rok commented 7 months ago

Thank you so much for the fast response. And I love the seaborn library!

How precisely does this change the computation? Can you please point me to the file where this is done?

I'm struggling to understand mathematically what is different when computing mean and std in log space.

In particular, I am not sure why the mean would change. I am actually measuring the squared cost so all my data lies on $[0, \infty)$. I have no negative values that would mess up the log computation, as far as I can tell.

mwaskom commented 7 months ago

Probably the best way to think about it is that you should get the same result as if you passed seaborn the log of your data and then modified the tick labels. Your error bars are symmetric around the mean because they are being drawn from mean(y) - sd(y) to mean(y) + sd(y).

gil2rok commented 7 months ago

Some want to first compute summary statistics and then transform them to the log scale.

Others want to first transform data to the log scale and then compute summary statistics. Seaborn appears to do the latter.

Probably the best way to think about it is that you should get the same result as if you passed seaborn the log of your data and then modified the tick labels. Your error bars are symmetric around the mean because they are being drawn from mean(y) - sd(y) to mean(y) + sd(y).

In your example, y=log(x) for some original data x that we first transform to the log scale and then compute its mean and std.

If one were interested in the former, should they plot without the log scale parameter and afterwards manually set the axis to be logarithmic?

Potentially relevant stack exchange post here.

gil2rok commented 7 months ago

Lastly, it may be helpful for this to appear somewhere in the docs. It was quite tricky for me to understand and I may not be the only one.

Perhaps on the tutorial page for statistical estimation and error bars here? I would consider making a pull request if you're interested. Need to confirm I have time for it though.

mwaskom commented 7 months ago

If one were interested in the former, should they plot without the log scale parameter and afterwards manually set the axis to be logarithmic?

Yes

gil2rok commented 7 months ago

Lastly, it may be helpful for this to appear somewhere in the docs. It was quite tricky for me to understand and I may not be the only one.

Perhaps on the tutorial page for statistical estimation and error bars here? I would consider making a pull request if you're interested. Need to confirm I have time for it though.

@mwaskom Just wanted to bump this in case you didn't see. If you're not interested, no worries!

mwaskom commented 7 months ago

I could have sworn the docs already said that somewhere, maybe just in the seaborn.objects documentation though. This is a very general thing in seaborn: statistics are computed in the transformed space, so it also applies to e.g. boxplots, kdes, histograms, etc

RagnarGrootKoerkamp commented 4 months ago

I just updated from 0.12 to 0.13, and the mean in the boxplot below is now computed in the transformed domain, whereas before it was the linear domain.

import matplotlib.pyplot as plt
import seaborn as sns
plt.yscale("log") # mean in log domain
sns.boxplot(
    x=[0, 0, 0, 0, 0],
    y=[1, 2, 3, 4, 5],
    showmeans=True,
)
# plt.yscale("log") # mean in linear domain
plt.show()

Putting the plt.yscale("log") after the sns.boxplot(..) preserves the original behaviour. Is this documented somewhere? I see the release notes present this as an 'enhancement', but it seems inconsistent to me. At least I was not aware changing the scale does not commute with other operations.

When doing plt.yscale('log') before the sns.boxplot(..), the new log_scale=True/False parameter doesn't seem to have any effect on the output either.