Missing values leading to ValueError: "Something went wrong" in predictions()

RoelVerbelen commented 5 months ago

Hey @vincentarelbundock,

It seems fitting a statsmodel or predicting from a statsmodel using incomplete data leads to a "ValueError: Something went wrong" in marginaleffects.predictions() at this line.

I created some examples to demonstrate:

Example 1: model fitted on incomplete data, predicting on incomplete data
Example 2: model fitted on complete data, predicting on incomplete data
Example 3: model fitted on incomplete data, predicting on complete data

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from marginaleffects import predictions

diamonds = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv")

# Example 1: model fitted on incomplete data, predicting on incomplete data

# Introduce missing values
diamonds['cut_ideal_null'] = diamonds['cut'].where(diamonds['cut'] != 'Ideal', None)

model = smf.ols("price ~ cut_ideal_null", data = diamonds).fit()

newdata = diamonds.iloc[0:20,:].copy()

# Works
model.predict(newdata)

# Fails: ValueError: Something went wrong
predictions(model, newdata=newdata)

# Example 2: model fitted on complete data, predicting on incomplete data

model = smf.ols("price ~ cut", data = diamonds).fit()

newdata = diamonds.iloc[0:20,:].copy()

# Introduce missing values
newdata['cut'] = newdata['cut'].where(newdata['cut'] != 'Ideal', None)

# Works
model.predict(newdata)

# Fails: ValueError: Something went wrong
predictions(model, newdata=newdata)

# Example 3: model fitted on incomplete data, predicting on complete data

diamonds['cut_ideal_null'] = diamonds['cut'].where(diamonds['cut'] != 'Ideal', None)

model = smf.ols("price ~ cut_ideal_null", data = diamonds).fit()

newdata = diamonds.iloc[0:20,:].copy()
newdata['cut_ideal_null'] = newdata['cut'].where(newdata['cut'] != 'Ideal', 'Premium')

# Works
model.predict(newdata)

# Fails: ValueError: Something went wrong
predictions(model, newdata=newdata)

RoelVerbelen commented 5 months ago

Sharing some insights from trying to investigate this further. They all fail on this line since out.shape[0] != newdata.shape[0] but for two different reasons:

Example 1 and 2 fail because when patsy.dmatrices() creates the design matrix, it silently drops the rows containing NAs (default argument NA_action='drop', see patsy docs) in this line. Throwing an error by setting NA_action='raise' and catching it with an informative message that data cannot have missings would be a good solution.
Example 3 fails because of the padding trick you are applying for making sure all dummy encoded columns get created when calling patsy.dmatrices() further down the track. In this line you consider all unique values of non-numeric columns which can include NAs (nulls). Filter them out at this stage would solve the error: uniqs = uniqs.drop_nulls().

A side note for your consideration: Rather than having to rely on this padding trick, it might be easier/cleaner/safer to use the design info of the model instead for creating the design matrix (which encodes all categories), rather than relying on the model formula:

import pandas as pd
import patsy
import statsmodels.formula.api as smf

diamonds = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv")

model = smf.ols("price ~ cut", data = diamonds).fit()

newdata = diamonds.iloc[0:1,:].copy() # only one observed category level

# Set up design matrix (for newdata with unobserved categories)
design_info = model.model.data.design_info
exog = patsy.dmatrix(design_info, newdata) # shape (1, 5)

vincentarelbundock commented 5 months ago

Thanks a lot for the report and investigation @RoelVerbelen , I really appreciate it.

Both suggestions sound perfectly reasonable. I'm happy to try them out, but realistically, it won't be in the short term, since I'm overcommitted at the "real job" right now.

Of course, I'd be very happy to review a PR if you or someone else volunteers (ideally, including a couple simple tests).

RoelVerbelen commented 1 month ago

Hi @LamAdr,

Thanks for addressing this by removing the padding altogether. I've tested these examples in this ticket again using the latest version of the code from github.

The first two examples now lead to a PatsyError: factor contains missing values which is informative and the right thing to do and the third example now works.

Thanks again for incorporating the suggestion of using dmatrix instead of padding! Looking forward to seeing the new version land on PyPi.

vincentarelbundock commented 1 month ago

Thanks for testing, I really appreciate it!

(I think the current pypi includes the fix)

RoelVerbelen commented 1 month ago

No worries, happy to help.

This fix just missed the cut for version 0.0.11 actually, see commits history.

vincentarelbundock commented 1 month ago

ah good catch, thanks.

Should be out now in 0.0.12

vincentarelbundock / pymarginaleffects

Missing values leading to ValueError: "Something went wrong" in predictions() #83