matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

Add required variables to the `Formula` class #179

Open timpiperseek opened 4 months ago

timpiperseek commented 4 months ago

I would like to be able to do something like the following. Appologies I am struggling to articulate what I want but effectively I want the following.

Say I have the following formula. apps ~ prior_apps + I(prior_apps^2) + factor + I(prior_apps:factor)

I am wondering if it is possible to get extract out the rhs terms from the formula. By terms I mean ['prior_apps','factor']

I have tried doing the following.

formula_parser = formulaic.parser.DefaultFormulaParser()
tokens = formula_parser.get_tokens(formula_str)
tokens = [t for t in tokens]

but that gets me the individual parts of the string and not the terms.

I feel like it should be possible?

matthewwardrop commented 3 months ago

Hi @timpiperseek ,

Does something like the following work?

from formulaic import Formula
f = Formula("apps ~ prior_apps + I(prior_apps**2) + factor + prior_apps:factor")
set(
    factor
    for term in f.rhs
    for factor in term.factors
)
# This would output all the factors: {1, I(prior_apps ** 2), factor, prior_apps}

(Note that interaction terms should not be enclosed in "I(...)", since that is a Python function call).

If you need to, you could parse the AST represented by the non-lookup factors (e.g. I(prior_apps ** 2)) to extract the variables used; prior_apps here.

If you are actually just looking for the terms, you can do: list(f.rhs) == [1, prior_apps, I(prior_apps ** 2), factor, prior_apps:factor].

Does that help?

timpiperseek commented 3 months ago

yeah that is really close to what I am after.

what do you mean by

If you need to, you could parse the AST represented by the non-lookup factors (e.g. I(prior_apps ** 2)) to extract the variables used; prior_apps here.

because ideally it would also identify that prior_apps**2 is the same underlying metric as prior_apps.

matthewwardrop commented 3 months ago

Ah... Using some internal utility functions you can do:

from formulaic import Formula
from formulaic.utils.variables import get_expression_variables
f = Formula("apps ~ prior_apps + I(prior_apps**2) + factor + prior_apps:factor")
set(
    variable
    for term in f.rhs
    for factor in term.factors
    for variable in get_expression_variables(factor.expr, {})
    if "value" in variable.roles
)
# Outputs: {'factor', 'prior_apps'}

Note that get_expression_variables parses the AST associated with the python expression, which is used internally to keep track of which variables have been used when generating the model matrix.

timpiperseek commented 3 months ago

Oh that is absolutely awesome, thank you.

matthewwardrop commented 3 months ago

I'll consider adding this directly to the formula class as something like .required_variables.

mayer79 commented 1 week ago

This would indeed be very handy, thx.