washingtonpost / elex-live-model

a model to generate estimates of the number of outstanding votes on an election night based on the current results of the race
48 stars 5 forks source link

Fix/elex 1981 fixed effect bug #27

Closed lennybronner closed 1 year ago

lennybronner commented 1 year ago

Description

Fixed effect bug where the fixed effect category of the unit that came first in the preprocessed data needed to have at least one reporting unit so that covariate matrix was invertible.

On election night in November we couldn't add a fixed effect before Alabama had at least one reporting county. This was because we used pd.get_dummies(..., drop_first=True), where drop_first=True was supposed to make sure that the matrix was invertible. However, it drops the first column independently of whether that fixed effect has a reporting unit or not. So before Alabama was reporting, and while using postal_code fixed effects we ran into the issue where the column we were dropping was all zeroes. Which did nothing to avoid the linear dependence of the other fixed effect columns of the reporting units.

We've changed how we generate fixed effects such that we now do that when getting the reporting units. This makes sure that we only generate the fixed effects for the categories that we need (ie. only for states where at least one unit has already reported). To make sure that the nonreporting dataframe have the same fixed effects, we manually add the missing categories when generating the nonreporting dataframe.

The nonreporting dataframe can have additional categories (fixed effects that don't appear in the reporting set), but we don't need to worry about those since we select out the features (incl. fixed effects) that we need when fitting and predicting with the model.

Jira Ticket

elex-1981

Test Steps

To replicate the bug, change line 94 in the cli.py to

data_handler.shuffle(upweight={"postal_code": {"AL": 0.01}})

and run. This should generate live data where no Alabama county is reporting, which should crash the model. This no longer happens in the new branch.

Also, updated some tests to reflect this change. `