washingtonpost / elex-live-model

a model to generate estimates of the number of outstanding votes on an election night based on the current results of the race
48 stars 5 forks source link

ELEX-2305-dara-agg-model-DATA-SCIENCE-EXPERIMENTAL #55

Closed daragold closed 1 year ago

daragold commented 1 year ago

Description

This is the first version of the aggregate model. It builds off of the current model's state predictions to get point estimates and confidence intervals for a national aggregate vote (i.e. electoral votes). Note that it doesn't matter if the underlying model uses the nonparametric or gaussian version, because this agg approach just grabs each state's confidence intervals no matter which method was used to make them.

The rough overview of the agg model is:

  1. After the normal model is run, every state ends up with a point prediction for its votes (e.g. 1000 Dem votes and 235 GOP votes predicted for PA) AND confidence intervals (CI) around those numbers (e.g. 900-1100 Dem votes and 230-300 GOP votes in PA)
  2. The agg model wants to translate these for each state into one national estimate (e.g. Dem total 200 electoral votes and GOP 335 votes) and CI's around those too (e.e. Dem 190-210 electoral votes and GOP 330-350 electoral votes)
  3. To do this we run (default 1000) trials of sampling. In each trial we do a random draw across all the states' individual CI's. So the result of a trial is an estimate of Dem and GOP vote in each state, and from there we can get the total electoral votes for Dem and GOP in that trial
  4. We then combine the results of the trials into a final nationwide estimate with its own CI. This aggregate CI depends on how much variability we saw across trials. So a ton of fluctuation across trials means larger CI at agg level.

To run the agg-model, set three parameters, which are all in run.py of the testbed: agg_model_preds = True ci_method = 'percentile' (other options are 'normal_dist_mean' and 't_dist_mean') num_observations = 1

'ci_method' determines how confidence intervals are computed around the electoral-vote point estimates. Because this agg model conducts random trials - in each trial a number of electoral votes are predicted - the ci_method is needed to say how CI bounds are determined from the trial results.

num_observations is the number of random draws per trial. Right now it is set to one, with a default of 1000 trials. This means one (joint) sample is drawn across the states in each of the 1000 trials.

If you use the agg-model, it is invoked in client.py line ~340 in this branch. This calls the function 'get_national_summary_votes_estimate', which lives in BaseElectionModel.py. Almost all the functions that uses are in the same script.

Note** The second Jira ticket here points out two new agg model features that should be sanity-checked. They deal with incorporating states that have been called, and aggregate votes already accrues before an election (for example in a Senate race, all the seats Dems have that are NOT up for election in a given cycle).

Jira Ticket

https://arcpublishing.atlassian.net/jira/software/c/projects/ELEX/boards/1026?modal=detail&selectedIssue=ELEX-2305

https://arcpublishing.atlassian.net/jira/software/c/projects/ELEX/boards/1026?modal=detail&selectedIssue=ELEX-2453

Test Steps