has2k1 / plotnine

A Grammar of Graphics for Python
https://plotnine.org
MIT License
3.97k stars 212 forks source link

quantiles drawn by geom_violin's draw_quantiles option are incorrect #847

Open glhr opened 1 month ago

glhr commented 1 month ago

The quantiles drawn by geom_violin are incorrect e.g. the 50% quantile does not correspond to the median. A simple example, where a boxplot is overlayed to show the expected position of the 25, 50 and 75% quantiles:

from plotnine import *
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "y": np.random.gamma(1,2,10),
    "x": ["a"]*10
})
plt = (
    ggplot(df, aes(x="x", y="y")) +
    geom_violin(draw_quantiles=[0.25,0.5,0.75]) +
    geom_boxplot(alpha=0.5,width=0.1,fill="grey")
)
plt.show()

Plotting the mean and median for comparison:

plt = (
    ggplot(df, aes(x="x", y="y")) +
    geom_violin(draw_quantiles=0.5) +
    geom_hline(data=df.groupby(["x"])["y"].describe(), mapping=aes(yintercept="mean"), color="red",alpha=0.5) +
    geom_hline(data=df.groupby(["x"])["y"].describe(), mapping=aes(yintercept="50%"), color="blue",alpha=0.5)
)
plt.show()

Tested with plotnine-0.13.6 and Python 3.10

has2k1 commented 1 month ago

For the violin, the quantiles are calculated for the density distribution. For the boxplot they are calculated for the original data.

The options that do not change the current behaviour are:

  1. Document this behaviour
  2. Have an option to specify whether to calculate the quantiles using the original data or the data from the density distribution.
glhr commented 1 month ago

Thanks for clarifying (and for the really great package!). It's quite fuzzy what the quantiles of the density distribution represent, since they depend on how the density estimation is implemented. For research publications, I would really like an option to draw the quantiles of the original data.

I'm willing to implement this feature if you like.