matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

Proposal: Add `commutative` to `Operator` #36

Open tomicapretto opened 2 years ago

tomicapretto commented 2 years ago

34 Adds support for random effects via the pipe operator. While going through the example in the PR I noticed that operations that are written as x2|m were then printed as m:x2 because formulaic sorts the factors within a term. Apart from the : that should be a | (this is not a major issue), order here does matter, so printing x2|m is not the same as printing m|x2.

This issue suggests that the Operator class could have a commutative property that indicates whether you can change the order of the operators or not.

But maybe there's a better approach? Let's have this space for discussion.

matthewwardrop commented 2 years ago

Hi @tomicapretto! I thought about this a bit over the weekend, and here is my proposal. Let me know what you think.

Rather than adding a commutative option to Operator objects, I propose to instead allow Term instances to override how resulting columns are named. This would allow you to subclass Term to say RandomEffectTerm. Random effect operators would then generated RandomEffectTerm instances instead of Term instances, but honour the same API. The result might look a bit like this (untested code):

data = pandas.DataFrame({'a': [1,2,3], 'b': ['a','b','c']})

class RandomEffectTerm(Term):
    def __init__(group_factors, group):
        self.group_factors = set(group_factors)
        self.group = group

    @property
    def factors(self):
         return self.group_factors | {self.group}

    def get_column_name(self, features):
         # features would look something like:
         # {'a': 'a', 'b': 'b[T.a]'}
         return ":".join(sorted(v for k, v in features.values if k != self.group.expr)) + "|" + features[self.group.expr]

model_matrix("(a|b)") -> columns of {"a|b[T.a]", "a|b[T.b]", ...}

If this looks good to you, I'll code it up when I get a chance. The last piece would then be to formalise how to surface the different segments of the model matrix (or perhaps how to store multiple of them), in order to get the "fixed effect" and "random effect" parts... but we can deal with that as a separate thread.