Closed tomicapretto closed 1 year ago
Hi @tomicapretto ,
Thanks for taking the time to report this. However, this is actually expected behaviour due to the ordered nature of formulae and the automatic full-rank algorithm (where terms to the left take precedence over terms to the right in terms of materialization).
That is:
>>> print(model_matrix("0 + g + 1", df))
g[T.a] g[T.b]
0 1 0
1 1 0
2 1 0
3 0 1
4 0 1
>>> print(model_matrix("0 + 1 + g", df))
Intercept g[T.b]
0 1.0 0
1 1.0 0
2 1.0 0
3 1.0 1
4 1.0 1
These model matrices are equivalent, and the columns span the vector space (i.e. the model matrix is full rank). In both cases the intercept is spanned, but in the former case it is spanned by the categorical factors.
I'll close this one out for now, but let me know if have further questions!
Hmm... but this does, upon further reflection, not match the behaviour expected by users of R or patsy. It might be worth special casing the full rank algorithm to deal with the intercept. Point taken. I'll fix this!
@matthewwardrop your explanation makes perfect sense (i.e. terms to the left take precedence over terms to the right). But as you said, it may be surprising to users coming from Patsy or R. I don't have a strong preference. But if you decide to continue with the current approach, it would be good to leave a note in the documentation explaining why it works the way it works in this specific case. I could open a PR :smile:
Thanks again for reporting this @tomicapretto . This should be resolved as of 0.6.5+.
In the following example I remove the intercept and then I add it. It turns out "it's never added". I'm not sure if this is a feature you would like to support, but if I mentally parse from left to right I would expect the final formula to contain the intercept.