matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
346 stars 25 forks source link

LinearConstraint parser still fails on Categorical variables #215

Open bashtage opened 5 hours ago

bashtage commented 5 hours ago

On main I'm still seeing failures in the code below. Having this work correctly on terms like C(var)[T.value] is practically an important case since C() is common when [T.value] is present. The tests only look at cases like A[T.value].

import formulaic

constraints = "C(agecat)[T.4] = C(agecat)[T.5]"
variable_names = [
    "Intercept",
    "C(agecat)[T.2]",
    "C(agecat)[T.3]",
    "C(agecat)[T.4]",
    "C(agecat)[T.5]",
    "logpyears",
    "smokes",
]

lc = formulaic.utils.constraints.LinearConstraints.from_spec(
    constraints, variable_names=variable_names
)

Exception is

---------------------------------------------------------------------------
FormulaSyntaxError                        Traceback (most recent call last)
Cell In[2], line 14
      3 constraints = "C(agecat)[T.4] = C(agecat)[T.5]"
      4 variable_names = [
      5     "Intercept",
      6     "C(agecat)[T.2]",
   (...)
     11     "smokes",
     12 ]
---> 14 lc = formulaic.utils.constraints.LinearConstraints.from_spec(
     15     constraints, variable_names=variable_names
     16 )

File C:\git\formulaic\formulaic\utils\constraints.py:102, in LinearConstraints.from_spec(cls, spec, variable_names)
     98     spec = ",".join(spec)
     99 if isinstance(spec, str):
    100     matrix, values = LinearConstraintParser(
    101         variable_names=variable_names
--> 102     ).get_matrix(spec)
    103     return cls(matrix, values, variable_names)
    104 matrices, constants = [], []

File C:\git\formulaic\formulaic\utils\constraints.py:277, in LinearConstraintParser.get_matrix(self, formula)
    263 def get_matrix(
    264     self, formula: str
    265 ) -> Tuple["numpy.typing.ArrayLike", "numpy.typing.ArrayLike"]:
    266     """
    267     Build the constraint matrix and constraint values vector associated with
    268     the parsed string.
   (...)
    275         A tuple of the contraint matrix and constraint values respectively.
    276     """
--> 277     constraints = self.get_terms(formula)
    278     if not constraints:
    279         return numpy.empty((0, len(self.variable_names))), numpy.array([])

File C:\git\formulaic\formulaic\utils\constraints.py:255, in LinearConstraintParser.get_terms(self, formula)
    246 def get_terms(
    247     self, formula: str
    248 ) -> Union[None, List[ScaledFactor], Tuple[List[ScaledFactor], ...]]:
    249     """
    250     Build the `ScaledFactor` instances for a constraint formula string.
    251
    252     Args:
    253         formula: The constraint formula for which to build terms.
    254     """
--> 255     ast = self.get_ast(formula)
    256     if not ast:
    257         return None

File C:\git\formulaic\formulaic\utils\constraints.py:240, in LinearConstraintParser.get_ast(self, formula)
    230 def get_ast(self, formula: str) -> Optional[ASTNode]:
    231     """
    232     Assemble an abstract syntax tree for the nominated `formula` string.
    233
   (...)
    236             generated.
    237     """
    238     return cast(
    239         Optional[ASTNode],
--> 240         tokens_to_ast(
    241             self.get_tokens(formula),
    242             operator_resolver=self.operator_resolver,
    243         ),
    244     )

File C:\git\formulaic\formulaic\parser\algos\tokens_to_ast.py:137, in tokens_to_ast(tokens, operator_resolver)
    135 if output_queue:
    136     if len(output_queue) > 1:
--> 137         raise exc_for_missing_operator(output_queue[0], output_queue[1])
    138     return output_queue[0]
    140 return None

FormulaSyntaxError: Missing operator between `C(agecat)` and `T.4`.

⧛C(agecat)[T.4⧚] = C(agecat)[T.5]
bashtage commented 5 hours ago

Sorry to report that the previous patch didn't quite get to the full set of issues.