matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

Potential bug in Interacting variables via `:` syntax for categorical variables #182

Closed s3alfisc closed 3 months ago

s3alfisc commented 3 months ago

Hi Matthew (@matthewwardrop),

First of all, thanks so much for answering all my questions. And please apologize when I sometimes do not respond immediately. I know that I should do better!

Now I am wondering if I have encountered a bug with the : operator.

Based on the formulaic docs, the : operator "Adds a new term that corresponds to the interaction of its operands (i.e. their elementwise product)."

But when I interact categorical variables, the variables themselves are added. Is that desired behavior?

Here is a reproducible example:

%load_ext autoreload
%autoreload 2
%reload_ext watermark
%watermark --iversions

# formulaic: 1.0.1
# pandas   : 1.5.3
# pyfixest : 0.18.0b2

import pyfixest as pf
import pandas as pd
from formulaic import model_matrix

data = pf.get_data(N = 200)
data.dtypes
# all float64

Simply interacting to floats creates the interaction and nothing else:

_, X1 = model_matrix("Y ~ f1:f2", data=data)
X1.columns
# Index(['Intercept', 'f1:f2'], dtype='object')

But converting the floats to categoricals leads to inclusion of "direct" effects:

_, X1 = model_matrix("Y ~ C(f1):C(f2)", data=data)
# X1.columns
# Index(['Intercept', 'C(f2)[T.1.0]', 'C(f2)[T.2.0]', 'C(f2)[T.3.0]',
#        'C(f2)[T.4.0]', 'C(f2)[T.5.0]', 'C(f2)[T.6.0]', 'C(f2)[T.7.0]',
#        'C(f2)[T.8.0]', 'C(f2)[T.9.0]',
#        ...

The same occurs when converting the variables to categoricals outside of model_matrix:

data["f1"] = pd.Categorical(data["f1"])
data["f2"] = pd.Categorical(data["f2"])

_, X1 = model_matrix("Y ~ f1:f2", data=data)
X1.columns
#Index(['Intercept', 'f2[T.1.0]', 'f2[T.2.0]', 'f2[T.3.0]', 'f2[T.4.0]',
#       'f2[T.5.0]', 'f2[T.6.0]', 'f2[T.7.0]', 'f2[T.8.0]', 'f2[T.9.0]',
#        ...
#       'f1[T.3.0]:f2[T.12.0]', 'f1[T.4.0]:f2[T.12.0]', 'f1[T.5.0]:f2[T.12.0]',

It seems to have nothing to do with NaN values:

_, X1 = model_matrix("Y ~ f1:f2", data=data.dropna())
X1.columns
# Index(['Intercept', 'f2[T.1]', 'f2[T.2]', 'f2[T.3]', 'f2[T.4]', 'f2[T.5]',
#        'f2[T.6]', 'f2[T.7]', 'f2[T.8]', 'f2[T.9]',
#        ...

If one of the interacted variables if of a numeric type, everything works as expected:

import numpy as np
data = data.dropna()

data["f1"] = data["f1"].astype(np.int64)
data["f2"] = data["f2"].astype(np.int64)

_, X1 = model_matrix("Y ~ C(f1):f2", data=data)
X1.columns
# Index(['Intercept', 'C(f1)[T.0]:f2', 'C(f1)[T.1]:f2', 'C(f1)[T.2]:f2',
#       'C(f1)[T.3]:f2', 'C(f1)[T.4]:f2', 'C(f1)[T.5]:f2', 'C(f1)[T.6]:f2',
#        'C(f1)[T.7]:f2', 'C(f1)[T.8]:f2', 'C(f1)[T.9]:f2', 'C(f1)[T.10]:f2',
#       'C(f1)[T.11]:f2', 'C(f1)[T.12]:f2'],

For reference, inR, interacting to factors does as described in the formulaic docs:

library(reticulate)
data = py$data
data$f1 = as.factor(data$f1)
data$f2 = as.factor(data$f2)

model.matrix(Y ~ f1:f2, data = data) |> colnames()
#  [1] "(Intercept)" "f10:f20"     "f11:f20"     "f12:f20"    
#  [5] "f13:f20"     "f14:f20"     "f15:f20"     "f16:f20"    
#  [9] "f17:f20"     "f18:f20"     "f19:f20"     "f110:f20"   
# [13] "f111:f20"    "f112:f20"    "f1NaN:f20"   "f10:f21" 

Best, Alex

matthewwardrop commented 3 months ago

Hi Alex! Thanks for reaching out again.

This is also not a bug, and is documented abstractly here: https://matthewwardrop.github.io/formulaic/guides/grammar/#behaviours-and-conventions image The patsy documentation related to this is here: https://patsy.readthedocs.io/en/latest/formulas.html#redundancy-and-categorical-factors.

In short, though, we want to ensure that the matrices generated are (at least structurally) full rank. Since categorical encodings span the intercept, including the full Cartesian product of values would span the intercept column, leading to non-full-rank model matrices. Instead, we give you full set of features that span the requested Cartesian product.

If you omit the intecept, you will see that the Cartesian product is more like you expect. image

You can also disable this behaviour by passing ensure_full_rank=False to the model_matrix or get_model_matrix functions, for example: image At the cost of guaranteeing structurally that the model matrix will not be full rank. In (e.g.) linear regression, this will cause X.T @ X to not be invertible and introduce a gauge degree of freedom in your model.

R is inconsistent with this kind of behaviour. It tries to reduce the rank of categorical variables on their own, but gives up for higher order interactions. We just do it properly.

s3alfisc commented 3 months ago

Ah, ok, this makes perfect sense! Thank you for your swift response =) My incorrect assumption was that formulaic mimics R's model.matrix(), but obviously it is based on patsy. I will have to think through if I want to set ensure_full_rank = False as a default for pyfixest (for perfect compatibility with fixest / R), or if I should keep with the formulaic defaults. Both fixest and pyfixest include a function to detect and drop multicollinear variables, so I (can) in principle ensure that the design matrix is of full rank even after calling formulaic.get_model_matrix(..., ensure_full_rank = False). I should also spend some time to read through the patsy docs!

As a side point, I've worked quite a bit with formulaic in the past week, and I found myself several times thinking that working with formulaic is really quite a pleasant experience! Starting from a already high base, using some of the "more advanced" features of the package has clearly increased my appreciation of your work even further! =)

matthewwardrop commented 3 months ago

Thanks for the compliment!

It's worth noting, though, that even when you set ensure_full_rank=False it won't match R exactly in all cases... Since R does sometimes try to reduce rank of categorical encodings to maintain full rank... It just isn't as consistent in doing so. There is no option in Formulaic that perfectly matches R's inconsistent behaviour.