Closed Hoeze closed 4 years ago
I encountered this issue as well. Please see the example below:
import pandas as pd
import plotnine as pn
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200,
n_features=1,
n_informative=1,
n_redundant=0,
n_clusters_per_class=1,
random_state=2)
df = pd.DataFrame({"x" : X.T[0], "y" : y})
df.y = df.y.astype("category")
df["wt"] = np.where(df["y"] == 1, 5, 1)
(pn.ggplot(df, pn.aes("x", fill="y")) +
pn.geom_density(position="fill") +
pn.theme_seaborn(style="whitegrid"))
Produces the following plot:
If we do:
(pn.ggplot(df, pn.aes("x", fill="y", weight="wt")) +
pn.geom_density(position="fill") +
pn.theme_seaborn(style="whitegrid"))
or
(pn.ggplot(df, pn.aes("x", fill="y")) +
pn.geom_density(pn.aes(weight="wt"), position="fill") +
pn.theme_seaborn(style="whitegrid"))
we get the same plot. However, if we do:
df2 = df.reindex(df.index.repeat(df["wt"]))
(pn.ggplot(df2, pn.aes("x", "stat(count)", fill="y")) +
pn.geom_density(position="fill") +
pn.theme_seaborn(style="whitegrid"))
We get:
Which is the expected result.
@has2k1 is there a way to produce the last plot above using weight
or without repeating rows in a dataframe?
This issue catches two bugs. One in plotnine and another in statsmodels. Conclusion, it seems weight
is not a commonly used parameter across the ecosystem!
@has2k1 is there a way to produce the last plot above using weight or without repeating rows in a dataframe?
No.
Edit: I can work around the issue in statsmodels without submitting a PR over there.
@pkhokhlov These two code snippets do weighting differently (with the bug fixed).
1.
(pn.ggplot(df, pn.aes("x", fill="y")) +
pn.geom_density(pn.aes(weight="wt"), position="fill") +
pn.theme_seaborn(style="whitegrid"))
2.
df2 = df.reindex(df.index.repeat(df["wt"]))
(pn.ggplot(df2, pn.aes("x", "stat(count)", fill="y")) +
pn.geom_density(position="fill") +
pn.theme_seaborn(style="whitegrid"))
In snippet 1 the weight is normalised within in each group, while in snippet 2 the weight is applied across the whole dataset (i.e all groups). So in snippet 1 weight within each group is constant across all items a result of df["wt"] = np.where(df["y"] == 1, 5, 1)
and fill='y'
.
@has2k1 understood, makes sense. Thank you for your work with the library and this fix. With the bug fixed, is there a way to produce snippet 2's result with weights across all observations without creating the new dataframe (eg if I have fractional weights)? If not, could you recommend some workarounds to produce a similar plot?
I do not think you can do that, because for a kernel density algorithm there are two ways to affect the contribution of any distinct value towards the final density.
For stability of the algorithms the weighting (multiplication) is normalised to the [0, 1] domain for any given density computation. That shuts out option 2 leaving you with option 1.
So maybe you can make it easier by creating a helper function using something like
def weight_to_frequency(df, wt, precision=3):
ns = np.round(((wt/sum(wt)) * (10**precision))).astype(int) # no. times to replicate
idx = np.repeat(df.index, ns) # selection indices
df = df.loc[idx].reset_index(drop=True) # replication
return df
to come up with integer replication factors.
Hi, I noticed that when running
pn.ggplot(df, pn.aes(x="x, weight="w")) + pn.geom_density()
the density is ignored. I am using plotnine version 0.6.0.I validated the difference by running
df.reindex(df.index.repeat(df["w"]))
and plotting this without the weight argument.