matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
346 stars 25 forks source link

Is it possible to force the `Formula` class to not expand categorical variables? #178

Closed s3alfisc closed 8 months ago

s3alfisc commented 8 months ago

Hi @matthewwardrop, I just realized all the power given to me by the Formula class - it's really super neat and will save me quite a few lines of code! For my use case, I'd like to force Formula.get_model_matrix() to not expand categorical variables for a particular input formula "fixef". Is it possible to achieve this easily?

Here is a quick example:

from formulaic import Formula
import pyfixest as pf

data = pf.get_data()
Formula(lhs = "Y", fixef = "f1 + f2").get_model_matrix(data)

If f1 is of type pd.Categorical, .get_model_matrix() applies the standard one-hot encoding. But I'd like to return fixef as a non-encoded data frame, e.g. I'd like the output to look as if f1 and f2 were e.g. integers:

.fixef:
           f1    f2
    1     6.0  21.0
    3     1.0  10.0
    4    19.0  20.0
    5    13.0   3.0
    6     2.0  16.0
    ..    ...   ...
    995  14.0  23.0
    996  19.0  17.0
    997   3.0   5.0
    998  18.0  20.0
    999   4.0  19.0

Is this possible to achieve?

Best, Alex

s3alfisc commented 8 months ago

In R, one could reach the decide outcome by wrapping the factor / categorical into as.numeric():

data(mtcars)
mtcars[, "hp"] = as.factor(mtcars[, "hp"])
sapply(mtcars, class)
# mpg       cyl      disp        hp      drat        wt 
# "numeric" "numeric" "numeric"  "factor" "numeric" "numeric" 
# qsec        vs        am      gear      carb 
# "numeric" "numeric" "numeric" "numeric" "numeric" 

model.matrix(mpg ~ hp, mtcars) |> dim() # [1] 32 22
model.matrix(mpg ~ as.numeric(hp), mtcars) |> dim() # [1] 32  2
s3alfisc commented 8 months ago

Ah, of course it works with contexts:

from formulaic import Formula
import pyfixest as pf
import pandas as pd

def to_numeric(x):
    return pd.to_numeric(x)

data = pf.get_data()
data["f1"] = data["f1"].astype("category")
Formula(lhs = "Y", fixef = "to_numeric(f1) + f2").get_model_matrix(data, context = {"to_numeric": to_numeric})
.fixef:
         to_numeric(f1)    f2
    1               6.0  21.0
    3               1.0  10.0
    4              19.0  20.0
    5              13.0   3.0
    6               2.0  16.0
    ..              ...   ...
    995            14.0  23.0
    996            19.0  17.0
    997             3.0   5.0
    998            18.0  20.0
    999             4.0  19.0

Cool! =)

matthewwardrop commented 5 months ago

A bit late, and a bit data-specific, but you could also use: Formula(lhs = "Y", fixef = "f1.to_numeric() + f2").

I wonder whether we should add something like O(), N(), and raw() for explicit ordinal, numerical, and passthrough encodings respectively?