matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
345 stars 25 forks source link

Dropping Indices via "+0" or "-1" and reference levels for categoricals #161

Closed s3alfisc closed 1 year ago

s3alfisc commented 1 year ago

Hi Matthew,

I am trying to implement some tricks for reference levels of categorical variables that I need for event study estimators. I've observed the following behavior, and I wonder if it is desired, or if it might be a bug =)

First I load some data:

import pandas as pd
from formulaic import model_matrix
df_het = pd.read_csv("https://raw.githubusercontent.com/s3alfisc/pyfixest/master/pyfixest/experimental/data/df_het.csv")
df_het.head()

# unit  state   group   unit_fe g   year    year_fe treat   rel_year    rel_year_binned error   te  te_dynamic  dep_var
# 0 1   33  Group 2 7.043016    2010    1990    0.066159    False   -20.0   -6  -0.086466   0   0.0 7.022709
# 1 1   33  Group 2 7.043016    2010    1991    -0.030980   False   -19.0   -6  0.766593    0   0.0 7.778628
# 2 1   33  Group 2 7.043016    2010    1992    -0.119607   False   -18.0   -6  1.512968    0   0.0 8.436377
# 3 1   33  Group 2 7.043016    2010    1993    0.126321    False   -17.0   -6  0.021870    0   0.0 7.191207
# 4 1   33  Group 2 7.043016    2010    1994    -0.106921   False   -16.0   -6  -0.017603   0   0.0 6.918492

I now create a model matrix via formulaic and drop state 1 / set it as the reference level:

# Intercept, reference state 1 is dropped
_, X = model_matrix("dep_var ~ C(state, contr.treatment(base=1))", df_het)
X.columns
#Index(['Intercept', 'C(state, contr.treatment(base=1))[T.2]',
#       'C(state, contr.treatment(base=1))[T.3]',
#       'C(state, contr.treatment(base=1))[T.4]',
#       'C(state, contr.treatment(base=1))[T.5]',

As expected, the model matrix has an intercept & state 1 is "dropped".

If I now add a "+0" or "-1" to the formula, I'd expect from R's fixest that the Intercept is dropped on top of the reference state 1.

But this does not seem to happen - only the intercept is dropped, and the contr.treatment argument appears to be overturned.

# adding -1 removes the intercept, but the reference state "1" is no longer dropped
_, X = model_matrix("dep_var ~ -1 + C(state, contr.treatment(base=1))", df_het)
X.columns
#Index(['C(state, contr.treatment(base=1))[T.1]',
#       'C(state, contr.treatment(base=1))[T.2]',
#       'C(state, contr.treatment(base=1))[T.3]',
#       'C(state, contr.treatment(base=1))[T.4]',

In R's fixest - which I understand is not at all formulaic's main role model - we get the same model matrix as above minus the column for the reference state 1.

library(data.table)
library(fixest)
df_het <- fread("https://raw.githubusercontent.com/s3alfisc/pyfixest/master/pyfixest/experimental/data/df_het.csv")

es <- feols(dep_var ~ - 1 + i(state, ref = 1) , df_het)
fixest:::model.matrix.fixest(es, type = "rhs") |> colnames()
# [1] "state::2"  "state::3"  "state::4"  "state::5" 
# [5] "state::6"  "state::7"  "state::8"  "state::9" 
# [9] "state::10" "state::11" "state::12" "state::13"
# [13] "state::14" "state::15" "state::16" "state::17"
# [17] "state::18" "state::19" "state::20" "state::21"
# [21] "state::22" "state::23" "state::24" "state::25"
# [25] "state::26" "state::27" "state::28" "state::29"
# [29] "state::30" "state::31" "state::32" "state::33"
# [33] "state::34" "state::35" "state::36" "state::37"
# [37] "state::38" "state::39" "state::40"

So, is this intended behavior? =)

Best, Alex

s3alfisc commented 1 year ago

Ok, I did a little bit of googling, and I am - after almost 10 years of running regressions in R - astonished to learn that formulaic's C(, contr.treatment) syntax goes back to base R!

df_het$state <- as.factor(df_het$state)
n = length(unique(df_het$state))
lm( dep_var ~ -1 + C(state, contr.treatment(n = n,base=1)), data=df_het ) |> coefficients()
# C(state, contr.treatment(n = n, base = 1))1 
# 0.7395872 
# C(state, contr.treatment(n = n, base = 1))2 
# 1.1637809 
# C(state, contr.treatment(n = n, base = 1))3 
# 0.7334002 

So I suppose everything is as it should be and I will close this issue. Though if you happen to know a good way to specify in the formula how to drop an Intercept "ex post" - please let me know! =)