Fixed effect bug where the fixed effect category of the unit that came first in the preprocessed data needed to have at least one reporting unit so that covariate matrix was invertible.
On election night in November we couldn't add a fixed effect before Alabama had at least one reporting county. This was because we used pd.get_dummies(..., drop_first=True), where drop_first=True was supposed to make sure that the matrix was invertible. However, it drops the first column independently of whether that fixed effect has a reporting unit or not. So before Alabama was reporting, and while using postal_code fixed effects we ran into the issue where the column we were dropping was all zeroes. Which did nothing to avoid the linear dependence of the other fixed effect columns of the reporting units.
We've changed how we generate fixed effects such that we now do that when getting the reporting units. This makes sure that we only generate the fixed effects for the categories that we need (ie. only for states where at least one unit has already reported). To make sure that the nonreporting dataframe have the same fixed effects, we manually add the missing categories when generating the nonreporting dataframe.
The nonreporting dataframe can have additional categories (fixed effects that don't appear in the reporting set), but we don't need to worry about those since we select out the features (incl. fixed effects) that we need when fitting and predicting with the model.
Description
Fixed effect bug where the fixed effect category of the unit that came first in the preprocessed data needed to have at least one reporting unit so that covariate matrix was invertible.
On election night in November we couldn't add a fixed effect before Alabama had at least one reporting county. This was because we used
pd.get_dummies(..., drop_first=True)
, wheredrop_first=True
was supposed to make sure that the matrix was invertible. However, it drops the first column independently of whether that fixed effect has a reporting unit or not. So before Alabama was reporting, and while usingpostal_code
fixed effects we ran into the issue where the column we were dropping was all zeroes. Which did nothing to avoid the linear dependence of the other fixed effect columns of the reporting units.We've changed how we generate fixed effects such that we now do that when getting the reporting units. This makes sure that we only generate the fixed effects for the categories that we need (ie. only for states where at least one unit has already reported). To make sure that the nonreporting dataframe have the same fixed effects, we manually add the missing categories when generating the nonreporting dataframe.
The nonreporting dataframe can have additional categories (fixed effects that don't appear in the reporting set), but we don't need to worry about those since we select out the features (incl. fixed effects) that we need when fitting and predicting with the model.
Jira Ticket
elex-1981
Test Steps
To replicate the bug, change line 94 in the
cli.py
toand run. This should generate live data where no Alabama county is reporting, which should crash the model. This no longer happens in the new branch.
Also, updated some tests to reflect this change. `