vincentarelbundock / pymarginaleffects

GNU General Public License v3.0
47 stars 8 forks source link

Incorrect `predictions(..., newdata=, by=)` results due to padding #95

Closed RoelVerbelen closed 1 month ago

RoelVerbelen commented 3 months ago

The padding logic in lines has some undesired effects when using the by argument in predictions(). Reason being that the unpadding logic doesn't get applied in this scenario.

Illustrative example: notice how the average for cut == "Ideal" is off when using predictions(..., by="cut") on the new data.

Maybe the most elegant solution to resolve this bug is to avoid the padding and unpadding logic altogether based on the design matrix suggestion in here?

Happy to hear your thoughts @vincentarelbundock.

import pandas as pd
import patsy
import statsmodels.formula.api as smf
from marginaleffects import predictions

diamonds = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv")

model = smf.ols("price ~ cut + clarity + color", data = diamonds).fit()

# New data to predict on (does not have all levels for clarity and color)
newdata = diamonds.iloc[0:20,:].copy()

# Expected averages
newdata["pred"] = model.predict(newdata)
newdata.groupby("cut", dropna=False).agg(pred=("pred", "mean")).reset_index()
    cut pred
0   Fair       3168.858681
1   Good       4846.841115
2   Ideal      5079.555111
3   Premium    4286.566011
4   Very Good  4529.955872

# Unexpected averages due to padding
predictions(model,  newdata=newdata, by="cut")
cut estimate    std_error   statistic   p_value s_value conf_low    conf_high
0   Fair    3168.858681 109.600257  28.912876   0.0 inf 2954.046126 3383.671237
1   Good    4846.841115 79.252595   61.156876   0.0 inf 4691.508884 5002.173347
2   Ideal   3901.606935 39.067764   99.867680   0.0 inf 3825.035525 3978.178345
3   Premium 4286.566011 48.503250   88.376883   0.0 inf 4191.501389 4381.630633
4   Very Good   4529.955872 49.580522   91.365634   0.0 inf 4432.779834 4627.131910
vincentarelbundock commented 2 months ago

Thanks for the report @RoelVerbelen !

This sounds like a fantastic idea.

Unfortunately, I'm overcommitted right now. But I'd be happy to review and merge.

Does @LamAdr have time to look into this?

LamAdr commented 2 months ago

In about 2-3 weeks, absolutely

RoelVerbelen commented 1 month ago

Confirming that this is resolved in the latest version of the code, thanks a lot!