has2k1 / plotnine

A Grammar of Graphics for Python
https://plotnine.org
MIT License
3.97k stars 213 forks source link

geom_density ignores "weight" argument #392

Closed Hoeze closed 4 years ago

Hoeze commented 4 years ago

Hi, I noticed that when running pn.ggplot(df, pn.aes(x="x, weight="w")) + pn.geom_density() the density is ignored. I am using plotnine version 0.6.0.

I validated the difference by running df.reindex(df.index.repeat(df["w"])) and plotting this without the weight argument.

pkhokhlov commented 4 years ago

I encountered this issue as well. Please see the example below:

import pandas as pd
import plotnine as pn
import numpy as np
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=200,
                           n_features=1,
                           n_informative=1,
                           n_redundant=0,
                           n_clusters_per_class=1,
                           random_state=2)

df = pd.DataFrame({"x" : X.T[0], "y" : y})
df.y = df.y.astype("category")

df["wt"] = np.where(df["y"] == 1, 5, 1)

(pn.ggplot(df, pn.aes("x", fill="y")) +
            pn.geom_density(position="fill") +
            pn.theme_seaborn(style="whitegrid"))

Produces the following plot: stacked_density1

If we do:

(pn.ggplot(df, pn.aes("x", fill="y", weight="wt")) +
 pn.geom_density(position="fill") +
 pn.theme_seaborn(style="whitegrid"))

or

(pn.ggplot(df, pn.aes("x", fill="y")) +
 pn.geom_density(pn.aes(weight="wt"), position="fill") +
 pn.theme_seaborn(style="whitegrid"))

we get the same plot. However, if we do:

df2 = df.reindex(df.index.repeat(df["wt"]))

(pn.ggplot(df2, pn.aes("x", "stat(count)", fill="y")) +
 pn.geom_density(position="fill") +
 pn.theme_seaborn(style="whitegrid"))

We get: stacked_density2

Which is the expected result.

@has2k1 is there a way to produce the last plot above using weight or without repeating rows in a dataframe?

has2k1 commented 4 years ago

This issue catches two bugs. One in plotnine and another in statsmodels. Conclusion, it seems weight is not a commonly used parameter across the ecosystem!

@has2k1 is there a way to produce the last plot above using weight or without repeating rows in a dataframe?

No.

Edit: I can work around the issue in statsmodels without submitting a PR over there.

has2k1 commented 4 years ago

@pkhokhlov These two code snippets do weighting differently (with the bug fixed).

1.

(pn.ggplot(df, pn.aes("x", fill="y")) +
 pn.geom_density(pn.aes(weight="wt"), position="fill") +
 pn.theme_seaborn(style="whitegrid"))

2.

df2 = df.reindex(df.index.repeat(df["wt"]))

(pn.ggplot(df2, pn.aes("x", "stat(count)", fill="y")) +
 pn.geom_density(position="fill") +
 pn.theme_seaborn(style="whitegrid"))

In snippet 1 the weight is normalised within in each group, while in snippet 2 the weight is applied across the whole dataset (i.e all groups). So in snippet 1 weight within each group is constant across all items a result of df["wt"] = np.where(df["y"] == 1, 5, 1) and fill='y'.

pkhokhlov commented 4 years ago

@has2k1 understood, makes sense. Thank you for your work with the library and this fix. With the bug fixed, is there a way to produce snippet 2's result with weights across all observations without creating the new dataframe (eg if I have fractional weights)? If not, could you recommend some workarounds to produce a similar plot?

has2k1 commented 4 years ago

I do not think you can do that, because for a kernel density algorithm there are two ways to affect the contribution of any distinct value towards the final density.

  1. It's frequency (i.e. addition)
  2. It's weight (i.e. multiplication, which is an shortcut of addition)

For stability of the algorithms the weighting (multiplication) is normalised to the [0, 1] domain for any given density computation. That shuts out option 2 leaving you with option 1.

So maybe you can make it easier by creating a helper function using something like

def weight_to_frequency(df, wt, precision=3):
    ns = np.round(((wt/sum(wt)) * (10**precision))).astype(int)  # no. times to replicate
    idx = np.repeat(df.index, ns)                     # selection indices
    df = df.loc[idx].reset_index(drop=True)     # replication
    return df

to come up with integer replication factors.