washingtonpost / elex-live-model

a model to generate estimates of the number of outstanding votes on an election night based on the current results of the race
48 stars 5 forks source link

ELEX-2418-find-optimal-lambda #56

Open rishasurana opened 1 year ago

rishasurana commented 1 year ago

Description

We can pass a regularization constant to our model lambda_, but on election night we don’t know what the optimal lambda_ is. Add a feature to the client that take options for lambda_, the features and fixed effects. It should read in the data and use k-fold cross validation to find the optimal lambda as the lowest average MAPE on the k hold out sets. The function should probably return the lambda_ and the average MAPE.

AC:

  1. A function that takes a list of possible lambda_ values, the features and fixed effects we are using, any other inputs that are necessary and run k-fold cross validation to pick the best lambda_ value to use
  2. Unit tests

Jira Ticket

https://arcpublishing.atlassian.net/browse/ELEX-2418

rishasurana commented 1 year ago

ok, so close. but I also just realized there is something weird happening with the cli. The parameter names don't match, I think (and also you will need to update the default in cli.py

Changed it such that it can be called like this: --lambda='[0.01, 0.05, 0.90, 0.56]' , but the internal variable name is still lambda_. The other alternative is using the click.option() multiple=True parameter (as used with some of the other params) which would require us to pass in values like this: --lambda 0.01 --lambda 0.05 --lambda 0.90 --lambda 0.56

rishasurana commented 1 year ago

If I run the code with postal code fixed effects I get a nan error:

elexmodel 2020-11-03_USA_G --office_id=P --estimands=dem --geographic_unit_type=county --pi_method=nonparametric --percent_reporting=30 --aggregates=postal_code --aggregates=unit --fixed_effects=postal_code --lambda='[1, 10, 100, 1000]'

When getting unit predictions for each fold nonreporting_units_features = self.featurizer.featurize_heldout_data(nonreporting_units), there is number of NAN values (1-2% of the total df) in nonreporting_units_features post calling featurize_heldout_data on the test set of each fold. I can't seem to replicate this issue for other fixed-effect/aggregate combinations and it works fine when not splitting the data into test/train sets. Each time I run this command, certain states have NAN values and cause errors more often than others (HI, DE, NJ). Which states cause errors varies each time because the fold data is now randomly selected.

Example runs: Columns that have NAN values, total NAN values in dataframe

  1. ['postal_code_HI'], 303
  2. ['postal_code_HI', 'postal_code_NJ', 'postal_code_NV'], 927
  3. ['postal_code_DE', 'postal_code_NJ', 'postal_code_WY'], 930
  4. ['postal_code_HI', 'postal_code_WA'], 602
  5. ['postal_code_DE', 'postal_code_MD'], 632
  6. ['postal_code_DE'], 281

Using fillna(0) here in the Featurizer class: new_heldout_data = new_heldout_data.join(missing_expanded_fixed_effects_df) resolves this issue because it seems like certain states may be inconsistent in certain folds for the smaller test sets and so when we use .join it fills them with NAN.

Switching to use .merge doesn't work here because there are no common columns to merge on.

rishasurana commented 1 year ago

It appears that lambda has little to no effect on our model predictions... for most cases

Testing 2020 data:

command: elexmodel 2020-11-03_USA_G --office_id=P --estimands=dem --geographic_unit_type=county --pi_method=gaussian --percent_reporting=50 --aggregates=postal_code --aggregates=unit --lambda='[18347589365782638975893745893653.0, 2.0, 3345683987523849082439027492748237490278420384928394829089032742355.0, 5.0, 9.0, 15.0, 25.0, 37.0, 483475289735820748302742374823742342.0, 1345325206.0, 99.0]'

output:

More than one lambda has the lowest average MAPE of: 0.067
[1.834758936578264e+31, 2.0, 3.3456839875238492e+66, 5.0, 9.0, 15.0, 25.0, 37.0, 4.834752897358207e+35, 1345325206.0, 99.0]

I've tested large and small lambda values and that still doesn't make a difference in the predictions. For example, let's consider the unit predictions for fold 1:

Lambda 1.834758936578264e+31: [421, 2420.0]
Lambda 2.0: [421, 2420.0]
Lambda 3.3456839875238492e+66: [421, 2420.0]
Lambda 5.0: [421, 2420.0]
Lambda 9.0: [421, 2420.0]
Lambda 15.0: [421, 2420.0]
Lambda 25.0: [421, 2420.0]
Lambda 37.0: [421, 2420.0]
Lambda 4.834752897358207e+35: [421, 2420.0]
Lambda 1345325206.0: [421, 2420.0]
Lambda 99.0: [421, 2420.0]

The same pattern holds true for the unit predictions of fold 2 and fold 3 (fold 2 != fold 3 ever, but fold 2 + lambda 1 = fold 2 + lambda 2). The fold train/test dataset that is given, consistently produces the same unit predictions across lambda values. So, the fold MAPE values calculated are the same for each lambda value.

I've removed default values and verified that the lambda values are correctly being passed through to the QR when called in find_optimal_lambda() and are being added to the loss function: loss function:

Sum([6.23963173e-04 1.53951848e-05 8.20251779e-04 ... 7.51431638e-06
 1.22245098e-04 1.22794926e-05] @ (Promote(0.5, (1517,)) @ abs([0.37310912 0.125      0.15607195 ... 0.53658537 0.13643178 0.35820896] + -[[1.]
 [1.]
 [1.]
 ...
 [1.]
 [1.]
 [1.]] @ var370) + Promote(0.0, (1517,)) @ ([0.37310912 0.125      0.15607195 ... 0.53658537 0.13643178 0.35820896] + -[[1.]
 [1.]
 [1.]
 ...
 [1.]
 [1.]
 [1.]] @ var370)), None, False)

loss function with lambda:

Sum([6.23963173e-04 1.53951848e-05 8.20251779e-04 ... 7.51431638e-06
 1.22245098e-04 1.22794926e-05] @ (Promote(0.5, (1517,)) @ abs([0.37310912 0.125      0.15607195 ... 0.53658537 0.13643178 0.35820896] + -[[1.]
 [1.]
 [1.]
 ...
 [1.]
 [1.]
 [1.]] @ var370) + Promote(0.0, (1517,)) @ ([0.37310912 0.125      0.15607195 ... 0.53658537 0.13643178 0.35820896] + -[[1.]
 [1.]
 [1.]
 ...
 [1.]
 [1.]
 [1.]] @ var370)), None, False) + 1.6786870797807898e+26 @ power(Pnorm(var370[1], 2), 2.0)

From what I've read some reasons that this could be happening are that the dataset is already well-behaved, the features are not highly correlated, or the model is not that complex.

  1. Adding features to the command also doesn't seem to help this case:

command: elexmodel 2020-11-03_USA_G --office_id=P --estimands=dem --geographic_unit_type=county --pi_method=gaussian --percent_reporting=50 --aggregates=postal_code --aggregates=unit --lambda='[12.0, 2.0, 33.0, 5.0, 9.0, 15.0, 25.0, 37.0, 234, 1000, 99.0]' --features=ethnicity_european --features=ethnicity_east_and_south_asian --features=ethnicity_hispanic_and_portuguese --features=ethnicity_likely_african_american

output:

More than one lambda has the lowest average MAPE of: 0.067
[12.0, 2.0, 33.0, 5.0, 9.0, 15.0, 25.0, 37.0, 234, 1000, 99.0]
  1. Modifying the previous 03_USA_G command that had lots of features with some values closer to zero does nothing:

command: elexmodel 2020-11-03_USA_G --office_id=P --estimands=dem --geographic_unit_type=county --pi_method=gaussian --percent_reporting=50 --aggregates=postal_code --aggregates=unit --lambda='[12.0, 2.0, 33.0, 5.0, 9.0, 15.0, 25.0, 37.0, 234, 1000, 0, 0.000006]' --features=ethnicity_european --features=ethnicity_east_and_south_asian --features=ethnicity_hispanic_and_portuguese --features=ethnicity_likely_african_american

output:

More than one lambda has the lowest average MAPE of: 0.073
[12.0, 2.0, 33.0, 5.0, 9.0, 15.0, 25.0, 37.0, 234, 1000]

So far the only test that seems to work as intended is this integration test:

command: elexmodel 2017-11-07_VA_G --estimands=dem --office_id=G --geographic_unit_type=precinct --aggregates=county_classification --aggregates=postal_code --fixed_effects=county_classification --percent_reporting 10 --features=ethnicity_european --features=ethnicity_hispanic_and_portuguese --lambda='[50, 0.5, 10, 100, 437837, 2, 0]'

output:

avg_MAPE values for each lambda: [0.143 0.137 0.143 0.143 0.143 0.143 0.097]
best: 0

Here, for a large dataset, we are able to produce more variation (3 distinct) in terms of average MAPE. 5 of the 7 given lambdas have the same MAPE. The issue with this test working as intended means that this is less likely to be a logic issue with the code and more likely to be something else. The rest of the integration tests only have 1 distinct MAPE (all the same).

  1. Running an similar command for the 2020 elections results in a similar output:

command: elexmodel 2020-11-03_USA_G --estimands=dem --office_id=P --geographic_unit_type=county --aggregates=county_classification --aggregates=postal_code --fixed_effects=county_classification --percent_reporting 10 --features=ethnicity_european --features=ethnicity_hispanic_and_portuguese --lambda='[50, 0.5, 10, 100, 437837, 2, 0.02, 0]'

output:

avg_MAPE values for each lambda: [0.083 0.077 0.083 0.083 0.073 0.087 0.07 ]
best: 0
  1. Added 0.02 as a potential lambda val

command: elexmodel 2020-11-03_USA_G --estimands=dem --office_id=P --geographic_unit_type=county --aggregates=county_classification --aggregates=postal_code --fixed_effects=county_classification --percent_reporting 10 --features=ethnicity_european --features=ethnicity_hispanic_and_portuguese --lambda='[50, 0.5, 10, 100, 437837, 2, 0.02, 0]'

output:

More than one lambda has the lowest average MAPE of: 0.07
[0.5, 0.02, 0]
avg_MAPE values for each lambda: [0.077 0.07  0.077 0.077 0.073 0.08  0.07  0.07 ]
best: 0.5

The unit tests: In the unit tests, the test datasets have a similar issue as before, generally producing a maximum of 2 unique MAPEs by the end of each run.

def test_find_optimal_lambda_tens():
    """
    Test/view computing lambda
    """
    lambda_ = [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
    model_settings = {"features": ["b"]}
    model = BaseElectionModel.BaseElectionModel(model_settings)
    df_X = pd.DataFrame(
        {
            "residuals_a": [1, 2, 3, 4],
            "total_voters_a": [4, 2, 9, 5],
            "last_election_results_a": [5, 1, 4, 2],
            "results_a": [5, 2, 8, 4],
            "results_b": [3, 6, 2, 1],
            "baseline_a": [9, 2, 4, 5],
            "baseline_b": [9, 2, 4, 5],
            "a": [2, 2, 3, 7],
            "b": [2, 3, 4, 5],
        }
    )
    new_lambda, avg_MAPE = model.find_optimal_lambda(df_X, lambda_, "a")
    assert new_lambda in [0.01, 10.0]  # more than one lambda value can have the same average MAPE

Here, the results oscillate between

  1. More than one lambda has the lowest average MAPE of: 1.0 [0.01, 0.1]
  2. More than one lambda has the lowest average MAPE of: 0.417 [10.0, 100.0, 1000.0]

So the best chosen lambda for this test is either 0.01 or 10.0 (since it chooses the first given lambda with the lowest average MAPE). This happens for the other 2 unit tests as well. The oscillation is due to the fact that we have now randomized the reporting units. Removing that gives us only the first result consistently.

For such a small dataset, there is a slight bit more variation (2 distinct MAPEs on average for a set of 5 lambdas). For a large dataset such as 2020-11-03_USA_G or 2017-11-07_VA_G, depending on the inputs, there is either only 1-3 distinct MAPEs given 5-10 lambdas.

My conclusion is that regularizing our model does not seem to impact the predictions in most cases, but not sure if this is valid?

jchaskell commented 1 year ago

Lenny should definitely weigh in, but a couple of thoughts: