openelections / openelections-data-ca

Pre-processed election results for California elections
MIT License
13 stars 17 forks source link

Fix/clean data to pass tests: 2020 General state Precinct file #245

Closed carbonphyber closed 2 years ago

carbonphyber commented 2 years ago

Remove mail in and early voting counts as they were mistakenly just copies of the total vote.

Errors fixed looked like:

======================================================================
  FAIL: test_vote_method_totals (data_tests.test_data.VoteBreakdownTotalsTest) [2020/20201103__ca__general__precinct.csv] (group='2020')
  ----------------------------------------------------------------------
  Traceback (most recent call last):
    File "/home/runner/work/openelections-data-ca/openelections-data-ca/data_tests/data_tests/test_data.py", line 158, in test_vote_method_totals
      self._assertTrue(data_test.passed, f"{self} [{short_path}]", short_message, full_message)
    File "/home/runner/work/openelections-data-ca/openelections-data-ca/data_tests/data_tests/test_data.py", line 59, in _assertTrue
      self.assertTrue(result, short_message)
  AssertionError: False is not true : There are 615 rows where the sum of ['early_voting', 'election_day', 'provisional'] is greater than 'votes':

    Headers: ['county', 'precinct', 'office', 'district', 'candidate', 'party', 'votes', 'early_voting', 'election_day', 'provisional']:
    Row 197801: ['Kern', '11155', 'Registered Voters', '', '', '', '1261', '1261', '1261', '']
    Row 197802: ['Kern', '11190', 'Registered Voters', '', '', '', '1610', '1610', '1610', '']
    Row 197803: ['Kern', '11195', 'Registered Voters', '', '', '', '1[36](https://github.com/carbonphyber/openelections-data-ca/runs/5245037563?check_suite_focus=true#step:5:36)6', '1366', '1366', '']
    Row 197804: ['Kern', '11530', 'Registered Voters', '', '', '', '929', '929', '929', '']
    Row 197805: ['Kern', '115[40](https://github.com/carbonphyber/openelections-data-ca/runs/5245037563?check_suite_focus=true#step:5:40)', 'Registered Voters', '', '', '', '1054', '1054', '1054', '']
    Row 197806: ['Kern', '115[46](https://github.com/carbonphyber/openelections-data-ca/runs/5245037563?check_suite_focus=true#step:5:46)', 'Registered Voters', '', '', '', '1123', '1123', '1123', '']
    Row 197807: ['Kern', '115[48](https://github.com/carbonphyber/openelections-data-ca/runs/5245037563?check_suite_focus=true#step:5:48)', 'Registered Voters', '', '', '', '1615', '1615', '1615', '']
    Row 197808: ['Kern', '115[50](https://github.com/carbonphyber/openelections-data-ca/runs/5245037563?check_suite_focus=true#step:5:50)', 'Registered Voters', '', '', '', '1408', '1408', '1408', '']
    Row 197809: ['Kern', '115[52](https://github.com/carbonphyber/openelections-data-ca/runs/5245037563?check_suite_focus=true#step:5:52)', 'Registered Voters', '', '', '', '1346', '1346', '1346', '']
    Row 197810: ['Kern', '115[54](https://github.com/carbonphyber/openelections-data-ca/runs/5245037563?check_suite_focus=true#step:5:54)', 'Registered Voters', '', '', '', '1660', '1660', '1660', '']
    [Truncated to 10 examples]

In this example, I used a VSCode Find-Replace RegEx to search for lines matching:

Kern,([0-9]+),Registered Voters,,,,([0-9]+),\2,\2,

And replaced them with

Kern,$1,Registered Voters,,,,$2,,,

The logic of the RegEx pattern was to look for Kern County records which had identical values for the votes, early_voting, election_day columns, then empty the values for early_voting, election_day columns.