matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
345 stars 25 forks source link

Add `include_intercept` to the `Formula` constructor? #71

Closed teucer closed 1 year ago

teucer commented 2 years ago

I think there is a discrepancy in the output of rhs:

  1. formula "y~x" -> rhs = "x+1"
  2. formula "y~x-1" -> rhs = "x"
  3. formula "y~x+0" -> rhs = "x"

I would have expected to get rhs = "x-1" or rhs = "x+0" for the second and third case.

matthewwardrop commented 2 years ago

Hi @teucer ! Actually this is behaving as expected. In Wilkinson formulae, -1 and +0 both have the effect of removing the intercept. Is there something I'm missing?

teucer commented 2 years ago

We want to be able to say model_matrix(rhs). But this would add back the intercept for 2nd and 3rd cases.

Maybe model_matrix should handle it:

  1. y~x
  2. ~x
  3. x

There could be a difference between 2 and 3. Right now they are handled the same way.

matthewwardrop commented 2 years ago

Hmmmm... I'm a bit confused as to what you are trying to achieve here.

There was a bug in formulaic.model_matrix related to ingesting pre-processed formula that is now fixed, and will go out in the next point release, but basically you should be able to do this:

>>> import pandas
>>> from formulaic import Formula, model_matrix
>>> f = Formula("y ~ x - 1")
>>> f.rhs
x
>>> model_matrix(f.rhs, pandas.DataFrame({"x": [1,2,3]}))
    x
0   1
1   2
2   3

Why are you re-processing string representations of the right hand side?

teucer commented 2 years ago

The issue is that in the current version

>>> model_matrix(f.rhs, pandas.DataFrame({"x": [1,2,3]}))
# -> FormulaInvalidError

>>> model_matrix("x", pandas.DataFrame({"x": [1,2,3]}))
        Intercept  x
0         1.0  1
1         1.0  2
2         1.0  3

Given the bug, we were trying to do with the string.

The more fundamental questions are, should there be a difference:

  1. between Formula("~x") and Formula("x")
  2. if f = Formula("y ~ x - 1"), between model_matrix("x", df) and model_matrix(f.rhs, df)

My opinion:

  1. Yes -> the first one is with intercept, the second one without
  2. No -> see point 1.
matthewwardrop commented 2 years ago

Hi again @teucer ,

The bug is now fixed, and I can roll out a new version for you if this is blocking you; otherwise I'll wait a few more days so it can also include some categorical encoding enhancements.

Your proposal is not unreasonable, but for consistency with expectations set by patsy and R, I'm going to make the executive choice not to treat ~x and x differently.

However, formulaic does (under the hood) support suppression of the addition of the intercept. For example, you could do:

>>> from formulaic import Formula
>>> from formulaic.parser import FormulaParser
>>> Formula(FormulaParser().get_terms("x", include_intercept=False))
x

It might be sensible to just add the include_intercept option in the Formula constructor, if that would be helpful to you.

teucer commented 2 years ago

Having include_intercept in the constructor would be really helpful.

matthewwardrop commented 1 year ago

Hi again!

As we move toward 1.0.0 I'm going back through all the issues and deciding things one way or another. Here, I'm deciding not to include this in the Formula constructor; however, since this issue was created it was moved to the parser constructor. So you can now do:

from formulaic import Formula
from formulaic.parser import DefaultFormulaParser
Formula("x", _parser=DefaultFormulaParser(include_intercept=False))

Due to modularity, and conflation with the Structured nature of Formula, this is no longer something I think would be a good idea.

Thanks again for sharing your thoughts and helping make Formulaic great!

I'll close this one out for now.