matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
345 stars 25 forks source link

ENH: Add 'grouped' _ordering method #212

Open bashtage opened 15 hours ago

bashtage commented 15 hours ago

Add ordering method that sorts by degree but not by variable name Add test for method

bashtage commented 15 hours ago

I couldn't match patsy in all cases. I think this should get it. I am 100% not wed to the name, and happy to change to anything that makes sense. I sort of liked "sort-groups", but don't really like hypens. Another though was "degree-only", but same problem. I can match patsy in both but not using the same _ordering.

This is the problem that I had with the existing ordering methods

formula = "TOTEMP ~ GNPDEFL + GNP + UNEMP + ARMED + POP + YEAR"
patsy
Variable names
['Intercept', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

formulaic, _ordering: sort
Variable names
['Intercept', 'ARMED', 'GNP', 'GNPDEFL', 'POP', 'UNEMP', 'YEAR']

formulaic, _ordering: degree
Variable names
['Intercept', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

formulaic, _ordering: none
Variable names
['Intercept', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

formula = "deaths ~ logpyears + smokes + C(agecat)"

patsy
Variable names
['Intercept', 'C(agecat)[T.2]', 'C(agecat)[T.3]', 'C(agecat)[T.4]', 'C(agecat)[T.5]', 'logpyears', 'smokes']

formulaic, _ordering: none
Variable names
['Intercept', 'logpyears', 'smokes', 'C(agecat)[T.2]', 'C(agecat)[T.3]', 'C(agecat)[T.4]', 'C(agecat)[T.5]']

formulaic, _ordering: sort
Variable names
['Intercept', 'C(agecat)[T.2]', 'C(agecat)[T.3]', 'C(agecat)[T.4]', 'C(agecat)[T.5]', 'logpyears', 'smokes']

formulaic, _ordering: degree
Variable names
['Intercept', 'logpyears', 'smokes', 'C(agecat)[T.2]', 'C(agecat)[T.3]', 'C(agecat)[T.4]', 'C(agecat)[T.5]']
bashtage commented 14 hours ago

Converted to a draft as this doesn't solve my problem. I need to look a bit deeper as to how patsy orders variables, especially with respect to categoricals.

bashtage commented 9 hours ago

I've looked into this a bit today and it seems that it isn't really possible to acieve patsy's ordering in the current structure where _ordering is part of the formula. Patsy know what type of variable each variable is when decides order. This is how it can reliable order the intercept, then categoricals (incl interactions, by degree order), then continuous variables (incl dummy variable interactions, again by degree-order).

Any ideas of how we could try to address this, even if it was in statsmodels? Could reorder variables in a rendered model somehow, adn then rerender if the order changes?