Open bashtage opened 5 years ago
patsy has an extensible parser – see infix_parser.py
and parse_formula.py
. It generates an abstract syntax tree, and then there's an interpreter that executes the AST to generate term lists – see desc.py
.
Both parts are extensible in principle – being able to support use cases like yours was included in the original design. parse_formula
is driven by an operator table, and it even takes an argument to allow extra operators to be added. Evaluator
is driven by a table too, and has some pieces included with the eye towards allowing syntax extensibility. And then you could take out the term lists + your special representations, and do whatever you want with them (e.g. run the term lists through patsy's matrix builder, while you handle the other parts in whatever way you want).
The parser currently uses the shunting-yard algorithm which is specialized for easily parsing infix arithmetic with precedence and parentheses, so it's easiest to extend by adding new operators, but it could potentially be extended to handle other things like your special [
grouping, or magic function calls like A(...)
.
But... none of this has ever actually been used, so it probably doesn't quite work. And unfortunately patsy has been essentially unmaintained for years now, so I'm not sure how you'd get there from here. If you have cash to burn I guess you could hire me...
Thanks. I'll see if I can figure out anything. Assuming I can't (which is reasonable), I can always use a half-way mixed interface with formula for the dense part and a list for the sparse.
I have written a function called
AbsorbingLS
that can absorb a large number (millions) of categorical variables or categorical interactions. It is implemented using a Frisch-Waugh-Lovell step where the categoricals are handled using scipy sparse matrices. I would like to add a formula interface. Suppose I have a functionA()
that indicates that a variable should be absorbed, is there any place to intervene in the formula parsing for a formula that looks likey ~ 1 + x + A(cat) + A(cat*x)
?I can use another syntax. In an instrumental variable regression I use the syntax
y ~ 1 + x1 + x2 + [x3 ~ z1 + z2]
which is used to determine the configuration of the 2 required regressions. This works fine since it is easy to parse the[]
and then it is a couple of standard calls. This approach doesn't obviously work here since I must avoid creating any arrays. I could use a similar structure here, so something likey ~ 1 + x + {cat +cat*x}
({}
for simplicity in parsing) which I would then need to find a good way to turncat +cat*x
into usable terms (w/o populating dense arrays).Any suggestions on how I could write a formula where I could intercept it using something like the pseudocode
Any suggestions are appreciated.