matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
333 stars 24 forks source link

Add support for multi-stage formulas. #24

Open matthewwardrop opened 4 years ago

matthewwardrop commented 4 years ago

In some of my work I am interested in exploring two-stage least-square regression on sparse data, and thus in making Formulaic able to handle it nicely.

My plan is to allow formulas of form: y ~ a + [b + c ~ z1 + z2] | a + [e + f ~ z1 + z2] | d + [b + c ~ z1 + z2] | d + [e + f ~ z1 + z2] In my proposed grammar, this would also be equivalent to: y ~ (a|d) + [b + c | e + f ~ z1 + z2] Using multipart syntax in the rhs of nested formulas would be forbidden.

The API for accessing the various pieces of this Formula is as yet not fully fleshed out, and naming has not been properly considered, but would be something like:

f = Formula('y ~ (a|d) + [b + c | e + f ~ z1 + z2]')
f.formula_for(rhs_part=0, stage=0)  # b + c ~ z1 + z2
f.formula_for(rhs_part=0, stage=1)  # y ~ a + b + c
f.formula_for(rhs_part=1, stage=0)  # e + f ~ z1 + z2
f.formula_for(rhs_part=1, stage=1)  # y ~ a + e + f
f.formula_for(rhs_part=0) # y ~ a + [b + c ~ z1 + z2]

f = Formula('y ~ x + z')
f.formula_for() # y ~ x + z

On a multipart formula like this one, calls to get_model_matrix will need to specify the part and stage for which the model matrix should be generated. If there is only one part or stage, this will not be necessary. Formulaic explicitly will not attempt to do any modeling with this, and will expect users of the library to do any memoisation that is required for two-stage least-squares to work when pumping new data sets through a pre-trained model.

I'm especially keen to know what @bashtage thinks about this, given that this is something he has explored a lot more in linearmodels.

bashtage commented 4 years ago

What is the intention of the first formula? What is exogenous and what is endogenous? Clearly the Z are instruments. .

matthewwardrop commented 2 years ago

Returning to this after several years :sweat: .

Multi-part formulas are already implemented as of v0.3.0: y ~ a | b | c does the right thing.

@bashtage : If I were to take this further, I'd look to implement something like: y ~ 1 + x1 + x2 + x3 + [ x4 + x5 ~ z1 + z2 + z3], exactly as you have done here. The results would be made available on the Structured instance as something like:

.lhs
    y
.rhs
    1 + x1 + x2 + x3 + IV[x4] + IV[x5]
    .iv_x4:
        z1 + z2 + z3
    .iv_x5:
        z1 + z2 + z3

This is within reach of the parser now, but I'd love your take on this (given that you have much more experience in this space).

bashtage commented 2 years ago

An advanced syntax would be great. I have a few current uses.

  1. IV like you have above.
  2. Absorbing regression where high dimensional fixed effects are absorbed. Something like y ~ x + [eff1 + eff2 + eff3] where eff# are categorical variables usually that are then encoded to sparse arrays.
  3. Systems equations. I currently use a dictionary. These models have multiple equations, something like y1 ~ x + z, y2 ~ x + w. Not sure if something like this woudl make sense to have as a syntax.
matthewwardrop commented 2 years ago

Nice. I don't yet know how much it makes sense to always have these advanced operators in place (versus having a family of parsers that extend some common set), but I'll definitely be working toward making the parser capable of generating formulae for these kinds of situations.

For further clarity: On 2. Absorbing regression is just your usual fixed-effects regression, right? Where you demean the data based on a set of covariates prior to modelling, perhaps using another regression? What would you want output in that case? Something like:

.lhs
    y_residuals
    .fixed_effects
         eff1 + eff2 + eff3
.rhs
    x

On 3. Would a Structured instance of a tuple of formulas work? That could be implemented trivially today (either in formulaic or downstream by adding the , operator):

[0]
    .rhs
        y1
    .lhs
        x + z
[1]
    ...
bashtage commented 2 years ago

I haven't really through about it. I could imagine that formulas could be nested. For example

y ~ 1 + x + [w ~ z]

could be something like

.lhs
   y
.rhs
   1 + x + [w ~ z]

and when you access .rhs it would be [1{Term}, x{Term}, [w ~ z]{Formula}] so that one could handle nested formulas with some recusions, e.g.

for term_or_fmla in formula.rhs.terms:
    if isinstance(term_or_fmla , Term):
        """Do something"""
    else:
        """Handle nested formula probably using a recursion"""

Maybe too complicted.

GuiMarthe commented 6 months ago

Just to add to this style of syntax, mlogit uses something similar for multinomial choice models. Not saying it should be implemented here, but there is another use case for the | syntax. In that literature, y ~ x | z | w, is the notation used for the anatomy of utility functions and translates to

choice ~ alternative vars. with generic coefficients |
                individual vars. with specific coefficients |
                alternative vars. with specific coefficients
matthewwardrop commented 6 months ago

@GuiMarthe This is actually already implemented in Formulaic (leaving the interpretation to the calling library).

image

The wrapping library would then just need to validate that the formula has the expected structure (it could also, if desired, disable the intercept additions in the formula parser).