matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
346 stars 25 forks source link

Make a `Q` operator that behaves like patsy's Q. #115

Closed rchui closed 1 year ago

rchui commented 2 years ago

The formula grammar for patsy and formulaic are close but not 1:1. It would be great if terms like Q('...') could automatically be converted into `...`, etc. This would help ease the adoption curve to migrate from patsy to formulaic using existing formulae.

matthewwardrop commented 2 years ago

Hi again @rchui ,

Thanks for thinking through the obstacles that one might face when migrating on to formulaic from patsy; and this is not an unreasonable request. I'm also midway through improving documentation, and adding an explicit guide on migrating from patsy. I'm definitely not a big fan of the Q('col name') syntax, though (and once added, I'd need to support it forever!); so I want to be thoughtful in how I approach this.

Technically, we'd implement this as a stateful transform, and surface it just as patsy does (as a function that's available in the formula namespace). Thus, a formula like: Q('wacky name!') would generate a term labelled Q('wacky name!'), and the Q transform would just select out the appropriate column from the data/context. No actual conversion to the new syntax would take place, we'd just support both.

I guess the question is how useful this would be, and whether it is worth supporting forever these older formula grammars. I think there are three different ways we could take this:

  1. We think the patsy syntax is better and/or at least good enough to add to the formula specification in perpetuity, and literally add the Q method (and perhaps some other things) to the formula grammar.
  2. We think it is worth adding but not at the cost of cluttering the current formula grammar. We would add a "plugin" pack of compatibility shims that we enable when explicitly told to support patsy syntax. There may be other things to do here, like wrapping the contrast specification. This could even extend as far as mirroring the old patsy dmatrix-like API... but this is likely not worth it.
  3. We do nothing and just understand there will be a period of transition. If you want the old patsy formula grammar, use patsy. When you are ready to migrate, use Formulaic's grammar.

What are your thoughts here?

rchui commented 2 years ago

At first glance, I'm not sure this would be the approach that I would use. My inclination would be an API that extends the Formula class with a classmethod and have it apply the stateful transform under the covers. ie:

import formulaic as fm

fm.Formula.from_patsy(...).get_model_matrix(...)

This way this isn't isn't in your "mainline" logic path and it forces the user to be intentional about deriving a formulaic formula from a patsy formula. Using a from_* is a well used pattern in the Python space and I think many would find it natural when coming from pandas, etc.

matthewwardrop commented 1 year ago

I've updated the linked PR and brought it in as a standard patsy-compat transform. Rather than bifurcating formulaic into two different formula languages, it now just has explicit patsy shims, which people will likely migrate from in time.

Thanks for reaching out about this! And let me know if this doesn't work for you once the PR is merged!