matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

Is it possible to define custom operators? #177

Open s3alfisc opened 4 months ago

s3alfisc commented 4 months ago

Hi @matthewwardrop,

For pyfixest, @Wenzhi-Ding and I are currently discussing to add more syntactic formula sugar. For example, the original R package comes with a custom operator for interacting variables i() that slightly differs from C() as it allows to "drop" reference level columns from the model matrix, or operators for multiple estimation, which I have implemented in a very clunky and ad hoc way in pyfixest's FormulaParser. Eventually I'd like to revisit this part of the code (hopefully rather sooner than later as I am really not too proud) and am wondering if it is possible to easily integrate new "formula operators"? There are some hints in the docs and codebase that suggest that this might not be an impossible task 😄

As a more concrete example, would it be possible to e.g. introduce a new operator varlist that would evaluate

model_matrix("Y ~ varlist(X*))", data)

to Y ~ X1 + X2 + ... + Xk for all k variables in data that start with X by ourselves without "ad hoc" formula parsing on our end? Or would you recommend that we should stick with "ad hoc formula parsing"?

Please feel free to just tell me to take a closer look at the docs if appropriate =)

matthewwardrop commented 3 months ago

Hi @s3alfisc ,

Thanks for looping me in. Currently this is not directly possible in stock Formulaic :(. The closest you can do today is something like:

import pandas
import re

from formulaic import Formula
from formulaic.utils.stateful_transforms import stateful_transform

@stateful_transform
def varlist(pattern, _context=None):
    pattern = re.compile(pattern)
    return {
        variable: values
        for variable, values in _context.named_layers.get("data", {}).items()
        if pattern.match(variable)
    }

Formula("varlist('X.*')").get_model_matrix(pandas.DataFrame({"X1": [1,2,3], "X2": [1,2,3]}), context={"varlist": varlist})

which would output: image

This is equivalent to the additive terms you demonstrate above (with ugly naming), but would not work so well for interactions, since the varlist('X.*') when multiplied by itself would collapse, and even if you aliased varlist so you had two versions it, the materialized product would include duplicate X* features and cross-products like X*:X*.

Thinking through how this could be improved by additions to Formulaic: we are limited by the fact that the formula parser intentionally has no awareness of the dataset for which model matrices will be generated later on. So what we would need is support for rewriting formulae during materialization. Since we evaluate all of the factors prior to substituting them, we could for example return a new nested Formula as the output of a transform; which we then expand and evaluate the factors recursively until things resolve. To avoid the collapsing issue we would need to add a new syntax like y ~ !varlist('X*') * !varlist('X*'), where the ! indicates that the factors should not be collapsed during formula parsing. These nested formulae would be "expanded" into the parent formulae, and so naming would be a lot cleaner. The generated model matrices and specs would have no idea that the expansion happened (we would just hard-code the new formula into the specs). So... that would work, and wouldn't be too hard to implement... but does increase the complexity. Next question.... is it worth it? Would it be helpful?

matthewwardrop commented 3 months ago

A slightly less general variant of the above is to add specific syntax for this kind of operation. Something like:

y ~ {:X.*}

Where we leverage the existing Python code quoting and special case "Python" snippets that start with ! in much the way we describe above, but specifically for this expansion purpose. This is somewhat thematically aligned with #175 , where it is desired that we add the . operator that expands to all unused features in the data.