Add support for tracking variable usage in formulae.

matthewwardrop commented 1 year ago

As per #32 and #60, it is sometimes useful to be able to look up which variables were used by the formula and its terms. This can be used to slice columns, or to determine data requirements for subsequent model evaluations.

This patchset adds support for tracking this information during model matric materialization, which is exposed as attributes of the ModelSpec; that is:

>>> import pandas
>>> from formulaic import Formula
>>> df = pandas.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> mm = Formula("a + b + bs(b) + C(a, contr.treatment)").get_model_matrix(df)
>>> mm.term_variables
OrderedDict([(1, {}),
             (a, {'a': 'data'}),
             (b, {'b': 'data'}),
             (bs(b), {'bs()': 'transforms', 'b': 'data'}),
             (C(a, contr.treatment),
              {'C()': 'transforms',
               'a': 'data',
               'contr.treatment': 'transforms'})])
>>> mm.variable_terms
defaultdict(set,
            {'a': {C(a, contr.treatment), a},
             'b': {b, bs(b)},
             'bs()': {bs(b)},
             'C()': {C(a, contr.treatment)},
             'contr.treatment': {C(a, contr.treatment)}})
>>> mm.variable_indices
{'a': [1, 6, 7],
 'b': [2, 3, 4, 5],
 'bs()': [3, 4, 5],
 'C()': [6, 7],
 'contr.treatment': [6, 7]}
>>> mm.required_data_variables
{'a', 'b'}

closes: #32 closes: #60

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (c9daec8) 100.00% compared to head (b8dbe4e) 100.00%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #145 +/- ## ========================================== Coverage 100.00% 100.00% ========================================== Files 48 49 +1 Lines 2595 2705 +110 ========================================== + Hits 2595 2705 +110 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `100.00% <100.00%> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Matthew+Wardrop#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://app.codecov.io/gh/matthewwardrop/formulaic/pull/145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Matthew+Wardrop) | Coverage Δ | | |---|---|---| | [formulaic/materializers/base.py](https://app.codecov.io/gh/matthewwardrop/formulaic/pull/145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Matthew+Wardrop#diff-Zm9ybXVsYWljL21hdGVyaWFsaXplcnMvYmFzZS5weQ==) | `100.00% <100.00%> (ø)` | | | [formulaic/materializers/types/evaluated\_factor.py](https://app.codecov.io/gh/matthewwardrop/formulaic/pull/145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Matthew+Wardrop#diff-Zm9ybXVsYWljL21hdGVyaWFsaXplcnMvdHlwZXMvZXZhbHVhdGVkX2ZhY3Rvci5weQ==) | `100.00% <100.00%> (ø)` | | | [formulaic/materializers/types/scoped\_term.py](https://app.codecov.io/gh/matthewwardrop/formulaic/pull/145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Matthew+Wardrop#diff-Zm9ybXVsYWljL21hdGVyaWFsaXplcnMvdHlwZXMvc2NvcGVkX3Rlcm0ucHk=) | `100.00% <100.00%> (ø)` | | | [formulaic/model\_spec.py](https://app.codecov.io/gh/matthewwardrop/formulaic/pull/145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Matthew+Wardrop#diff-Zm9ybXVsYWljL21vZGVsX3NwZWMucHk=) | `100.00% <100.00%> (ø)` | | | [formulaic/utils/layered\_mapping.py](https://app.codecov.io/gh/matthewwardrop/formulaic/pull/145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Matthew+Wardrop#diff-Zm9ybXVsYWljL3V0aWxzL2xheWVyZWRfbWFwcGluZy5weQ==) | `100.00% <100.00%> (ø)` | | | [formulaic/utils/stateful\_transforms.py](https://app.codecov.io/gh/matthewwardrop/formulaic/pull/145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Matthew+Wardrop#diff-Zm9ybXVsYWljL3V0aWxzL3N0YXRlZnVsX3RyYW5zZm9ybXMucHk=) | `100.00% <100.00%> (ø)` | | | [formulaic/utils/variables.py](https://app.codecov.io/gh/matthewwardrop/formulaic/pull/145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Matthew+Wardrop#diff-Zm9ybXVsYWljL3V0aWxzL3ZhcmlhYmxlcy5weQ==) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

matthewwardrop commented 1 year ago

This patch has now been refined and just needs some unit testing to merge. I'm not expecting big changes now, unless I get feedback from folks.

>>> import pandas
>>> from formulaic import Formula
>>> df = pandas.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> ms = Formula("a + b + a:bs(b) + C(a, contr.treatment)").get_model_matrix(df).model_spec
>>> ms.term_variables
OrderedDict([(1, set()),
             (a, {'a'}),
             (b, {'b'}),
             (C(a, contr.treatment), {'C', 'a', 'contr.treatment'}),
             (a:bs(b), {'a', 'b', 'bs'})])
>>> ms.variable_terms
{'a': {C(a, contr.treatment), a, a:bs(b)},
 'b': {b, a:bs(b)},
 'C': {C(a, contr.treatment)},
 'contr.treatment': {C(a, contr.treatment)},
 'bs': {a:bs(b)}}
>>> ms.variable_indices
{'a': [1, 3, 4, 5, 6, 7],
 'b': [2, 5, 6, 7],
 'C': [3, 4],
 'contr.treatment': [3, 4],
 'bs': [5, 6, 7]}
>>> ms.variables
{'C', 'a', 'b', 'bs', 'contr.treatment'}
>>> ms.variables_by_source
{'data': {'a', 'b'}, 'transforms': {'C', 'bs', 'contr.treatment'}}

matthewwardrop / formulaic

Add support for tracking variable usage in formulae. #145

Codecov Report