Closed RoelVerbelen closed 2 months ago
Sharing some insights from trying to investigate this further. They all fail on this line since out.shape[0] != newdata.shape[0]
but for two different reasons:
Example 1 and 2 fail because when patsy.dmatrices()
creates the design matrix, it silently drops the rows containing NAs (default argument NA_action='drop'
, see patsy docs) in this line. Throwing an error by setting NA_action='raise'
and catching it with an informative message that data cannot have missings would be a good solution.
Example 3 fails because of the padding trick you are applying for making sure all dummy encoded columns get created when calling patsy.dmatrices()
further down the track. In this line you consider all unique values of non-numeric columns which can include NAs (nulls). Filter them out at this stage would solve the error: uniqs = uniqs.drop_nulls()
.
A side note for your consideration: Rather than having to rely on this padding trick, it might be easier/cleaner/safer to use the design info of the model instead for creating the design matrix (which encodes all categories), rather than relying on the model formula:
import pandas as pd
import patsy
import statsmodels.formula.api as smf
diamonds = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv")
model = smf.ols("price ~ cut", data = diamonds).fit()
newdata = diamonds.iloc[0:1,:].copy() # only one observed category level
# Set up design matrix (for newdata with unobserved categories)
design_info = model.model.data.design_info
exog = patsy.dmatrix(design_info, newdata) # shape (1, 5)
Thanks a lot for the report and investigation @RoelVerbelen , I really appreciate it.
Both suggestions sound perfectly reasonable. I'm happy to try them out, but realistically, it won't be in the short term, since I'm overcommitted at the "real job" right now.
Of course, I'd be very happy to review a PR if you or someone else volunteers (ideally, including a couple simple tests).
Hi @LamAdr,
Thanks for addressing this by removing the padding altogether. I've tested these examples in this ticket again using the latest version of the code from github.
The first two examples now lead to a PatsyError: factor contains missing values
which is informative and the right thing to do and the third example now works.
Thanks again for incorporating the suggestion of using dmatrix
instead of padding! Looking forward to seeing the new version land on PyPi.
Thanks for testing, I really appreciate it!
(I think the current pypi includes the fix)
No worries, happy to help.
This fix just missed the cut for version 0.0.11 actually, see commits history.
ah good catch, thanks.
Should be out now in 0.0.12
Hey @vincentarelbundock,
It seems fitting a statsmodel or predicting from a statsmodel using incomplete data leads to a
"ValueError: Something went wrong"
inmarginaleffects.predictions()
at this line.I created some examples to demonstrate: