pydata / patsy

Describing statistical models in Python using symbolic formulas
Other
956 stars 104 forks source link

Deferring Evaluation of Terms #149

Open bashtage opened 5 years ago

bashtage commented 5 years ago

I have written a function called AbsorbingLS that can absorb a large number (millions) of categorical variables or categorical interactions. It is implemented using a Frisch-Waugh-Lovell step where the categoricals are handled using scipy sparse matrices. I would like to add a formula interface. Suppose I have a function A() that indicates that a variable should be absorbed, is there any place to intervene in the formula parsing for a formula that looks like y ~ 1 + x + A(cat) + A(cat*x)?

I can use another syntax. In an instrumental variable regression I use the syntax y ~ 1 + x1 + x2 + [x3 ~ z1 + z2] which is used to determine the configuration of the 2 required regressions. This works fine since it is easy to parse the [] and then it is a couple of standard calls. This approach doesn't obviously work here since I must avoid creating any arrays. I could use a similar structure here, so something like y ~ 1 + x + {cat +cat*x} ({} for simplicity in parsing) which I would then need to find a good way to turn cat +cat*x into usable terms (w/o populating dense arrays).

Any suggestions on how I could write a formula where I could intercept it using something like the pseudocode

1. Patsy  parses to terms
2. I remove and terms that are absorbed, which are too large to express as dense arrays
3. Patsy parses the non-absorbed terms, which have reasonable sizes as dense matrices

Any suggestions are appreciated.

njsmith commented 5 years ago

patsy has an extensible parser – see infix_parser.py and parse_formula.py. It generates an abstract syntax tree, and then there's an interpreter that executes the AST to generate term lists – see desc.py.

Both parts are extensible in principle – being able to support use cases like yours was included in the original design. parse_formula is driven by an operator table, and it even takes an argument to allow extra operators to be added. Evaluator is driven by a table too, and has some pieces included with the eye towards allowing syntax extensibility. And then you could take out the term lists + your special representations, and do whatever you want with them (e.g. run the term lists through patsy's matrix builder, while you handle the other parts in whatever way you want).

The parser currently uses the shunting-yard algorithm which is specialized for easily parsing infix arithmetic with precedence and parentheses, so it's easiest to extend by adding new operators, but it could potentially be extended to handle other things like your special [ grouping, or magic function calls like A(...).

But... none of this has ever actually been used, so it probably doesn't quite work. And unfortunately patsy has been essentially unmaintained for years now, so I'm not sure how you'd get there from here. If you have cash to burn I guess you could hire me...

bashtage commented 5 years ago

Thanks. I'll see if I can figure out anything. Assuming I can't (which is reasonable), I can always use a half-way mixed interface with formula for the dense part and a list for the sparse.