vincentarelbundock / pymarginaleffects

GNU General Public License v3.0
58 stars 10 forks source link

Specifying model without `statsmodels.formulas.api` seems to not work in both `pandas` and `polars` #140

Open artiom-matvei opened 2 weeks ago

artiom-matvei commented 2 weeks ago

I think it would be useful to support specifying a model without statsmodels.formulas.api

Pandas example

import pandas as pd
import statsmodels.api as sm
import marginaleffects as me

# Load and prepare the data
penguins = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv")
penguins_clean = penguins[['island', 'bill_length_mm', 'flipper_length_mm']].dropna()
penguins_clean['island'] = pd.Categorical(penguins_clean['island'], categories=["Biscoe", "Dream", "Torgersen"])

# Prepare predictors and target
X = penguins_clean[[ 'bill_length_mm', 'flipper_length_mm']]
X = sm.add_constant(X)
y = penguins_clean['island'].cat.codes

# Fit the model
model_py = sm.MNLogit(y, X).fit()

# Extract and display coefficients
coef_py = model_py.params
print("Coefficients from Python model:")
print(coef_py)

me.predictions(model_py)

Currently it throws the error below, probably meaning that get_modeldata() could be generalized:

File c:\\...\\pymarginaleffects\\marginaleffects\\model_statsmodels.py:17, in ModelStatsmodels.get_modeldata(self)
     16 def get_modeldata(self):
---> 17     df = self.model.model.data.frame
     18     if not isinstance(df, pl.DataFrame):
     19         df = pl.from_pandas(df)

AttributeError: 'PandasData' object has no attribute 'frame'"

Polars example

Changing to polars does not fix it, it seems like get_modeldata() could be improved. See example below and error thrown:

import polars as pl
import statsmodels.api as sm
import marginaleffects as me

# Load and prepare the data
penguins = pl.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv")
penguins_clean = penguins.select(['island', 'bill_length_mm', 'flipper_length_mm']).drop_nulls()

# Step 3: Define island categories and create a mapping
island_categories = ["Biscoe", "Dream", "Torgersen"]
island_mapping = {island: code for code, island in enumerate(island_categories)}

# Step 4: Map 'island' to integer codes and create a new column 'island_code'
penguins_clean = penguins_clean.with_columns(
    pl.col('island').replace_strict(island_mapping).alias('island_code')
)

# Step 5: Prepare predictor variables
X = penguins_clean.select(['bill_length_mm', 'flipper_length_mm'])

# Step 6: Add a constant term for the intercept
X = X.with_columns(pl.lit(1.0).alias('const'))
X = X.select(['const', 'bill_length_mm', 'flipper_length_mm'])

# Step 7: Extract the target variable
y = penguins_clean['island']

# Step 8: Convert Polars DataFrames to NumPy arrays
X_np = X.to_numpy()
y_np = y.to_numpy()

# Step 9: Fit the multinomial logistic regression model
model_py = sm.MNLogit(y_np, X_np).fit()

# Step 10: Extract and display coefficients
coef_py = model_py.params
print("Coefficients from Python model:")
print(coef_py)

me.predictions(model_py)

Error thrown:

File c:\\...\\pymarginaleffects\\marginaleffects\\model_statsmodels.py:17, in ModelStatsmodels.get_modeldata(self)
     16 def get_modeldata(self):
---> 17     df = self.model.model.data.frame
     18     if not isinstance(df, pl.DataFrame):
     19         df = pl.from_pandas(df)

AttributeError: 'ModelData' object has no attribute 'frame'"
vincentarelbundock commented 2 weeks ago

This is not possible because all our contrast building functions work on data frames, not on arrays. For example, imagine you have a numeric predictor and a string predictor. In a data frame, that is two columns. In a numpy array, we must one hot encode the string variable, so the array has many more columns.

This is why in the scikit learn proposal, there must be a pipeline object to convert a single newdata into two separate X and y: https://github.com/vincentarelbundock/pymarginaleffects/issues/35