matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
347 stars 25 forks source link

Intercept is not added after being removed #148

Closed tomicapretto closed 1 year ago

tomicapretto commented 1 year ago

In the following example I remove the intercept and then I add it. It turns out "it's never added". I'm not sure if this is a feature you would like to support, but if I mentally parse from left to right I would expect the final formula to contain the intercept.

import pandas as pd
import formulaic
from formulaic import model_matrix
print("version", formulaic.__version__)
df = pandas.DataFrame({"g": ["a", "a", "a", "b", "b"]})
print(model_matrix("0 + g + 1", df))
version 0.6.4
   g[T.a]  g[T.b]
0       1       0
1       1       0
2       1       0
3       0       1
4       0       1
matthewwardrop commented 1 year ago

Hi @tomicapretto ,

Thanks for taking the time to report this. However, this is actually expected behaviour due to the ordered nature of formulae and the automatic full-rank algorithm (where terms to the left take precedence over terms to the right in terms of materialization).

That is:

>>> print(model_matrix("0 + g + 1", df))
   g[T.a]  g[T.b]
0       1       0
1       1       0
2       1       0
3       0       1
4       0       1
>>> print(model_matrix("0 + 1 + g", df))
   Intercept  g[T.b]
0        1.0       0
1        1.0       0
2        1.0       0
3        1.0       1
4        1.0       1

These model matrices are equivalent, and the columns span the vector space (i.e. the model matrix is full rank). In both cases the intercept is spanned, but in the former case it is spanned by the categorical factors.

I'll close this one out for now, but let me know if have further questions!

matthewwardrop commented 1 year ago

Hmm... but this does, upon further reflection, not match the behaviour expected by users of R or patsy. It might be worth special casing the full rank algorithm to deal with the intercept. Point taken. I'll fix this!

tomicapretto commented 1 year ago

@matthewwardrop your explanation makes perfect sense (i.e. terms to the left take precedence over terms to the right). But as you said, it may be surprising to users coming from Patsy or R. I don't have a strong preference. But if you decide to continue with the current approach, it would be good to leave a note in the documentation explaining why it works the way it works in this specific case. I could open a PR :smile:

matthewwardrop commented 1 year ago

Thanks again for reporting this @tomicapretto . This should be resolved as of 0.6.5+.