matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

Potential Bug / different defaults for Intercept / Reference Levels when using `Formula.get_model_matrix()` with categoricals #181

Closed s3alfisc closed 3 months ago

s3alfisc commented 3 months ago

Hi @matthewwardrop, I am wondering if the following behavior is a bug or not:

model_matrix() and Formula.get_model_matrix() handle reference levels for categoricals differently: model_matrix includes an intercept and drops a reference level by default, while get_model_matrix() does not.

from formulaic import Formula, model_matrix
import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3], "c": [1, 2, 3]})

_, rhs = model_matrix("a ~ C(b)", data = df)
rhs
# Intercept C(b)[T.2]   C(b)[T.3]
# 0 1.0 0   0
# 1 1.0 1   0
# 2 1.0 0   1

Formula(fml1 = "a ~ C(b)").get_model_matrix(df)
#    .rhs:
#           C(b)[T.1]  C(b)[T.2]  C(b)[T.3]
#        0          1          0          0
#        1          0          1          0
#        2          0          0          1

If intended behavior, is it possible to mimic model_matrix() behavior for Formula.get_model_matrix()?

s3alfisc commented 3 months ago

Thought about this for a bit and concluded it is likely intended behavior =)

matthewwardrop commented 3 months ago

Hi @s3alfisc ,

Just wanted to clarify a few things here:

Hope that helps.