washingtonpost / elex-live-model

a model to generate estimates of the number of outstanding votes on an election night based on the current results of the race
48 stars 5 forks source link

Updates to Featurizer #69

Closed lennybronner closed 1 year ago

lennybronner commented 1 year ago

Description

In order to get the upcoming bootstrap model working I had to make changes to how the Featurizer works. But these changes need to be compatible with our nonparametric and gaussian models. This PR is only for the necessary changes to the Featurizer and changes in the BaseElectionModel to work with the new Featurizer (and unit test changes).

Beyond centering and scaling the features and adding an intercept, the core problem that the Featurizer needed to deal with was generating fixed effects. Specifically that fixed effects in the fitting data might not appear in the holdout data and fixed effect values in the holdout data might not appear in the fitting data.

In the past we manually added and subtracted the columns. Instead we now generate the fixed effects for all units and instead differentiate between expanded fixed effects (all fixed effects that haven't been dropped to avoid multicolinearity) and active fixed effects (expanded fixed effects that appear in the fitting data).

Jira Ticket

Test Steps

These commands should still work as expected:

elexmodel 2020-11-03_USA_G --office_id=P --estimands=dem --geographic_unit_type=county --pi_method=nonparametric --percent_reporting=20 --aggregates=postal_code --fixed_effects=postal_code
elexmodel 2020-11-03_USA_G --office_id=P --estimands=dem --geographic_unit_type=county --pi_method=nonparametric --percent_reporting=20 --aggregates=postal_code --fixed_effects=postal_code --fixed_effects=county_classification

also tox for unit tests

lennybronner commented 1 year ago

Mostly looks good. I'm a bit hesitant about the FE approach but don't have any other ideas about how to solve the problem.

What part are you hesitant about?