openelections / openelections-data-or

Pre-processed results for Oregon elections
MIT License
18 stars 17 forks source link

Verification script #130

Closed nk9 closed 7 years ago

nk9 commented 7 years ago

We're generating lots of CSV files, but there are some common inconsistencies/mistakes. I'm thinking it would be useful to have a script people could run on newly completed CSV files to make sure it's following the OE format.

Things I thought could be verified:

  1. Right number of columns, in the right order
  2. Votes don't contain commas
  3. All numbers are integers (no decimals)
  4. No "X" for zero votes
  5. Consistent capitalization/spelling of "Write-in", "Under Votes", "Over Votes", "Total"
  6. County matches file name, is consistent on every line
  7. Party included for "Write-in," etc lines in primaries, not for generals/specials
  8. Office names properly normalised
  9. Only appropriate fields empty (i.e. only party [if a general] or district)

Has something like this already been created? Other ideas for sanity checks which could be run?

dwillis commented 7 years ago

That's a super good idea. We have, further down our processing pipeline, the concept of validations, but it would be better I think to have simple checks like this at this stage. The order of the columns doesn't necessarily matter too much, but # 2, 3, 5, 6, 7, 8 and 9 are solid checks. In Mississippi, we have been using 'X' to represent precincts that are not a part of specific legislative districts.

nk9 commented 7 years ago

OK, great. I'll take a stab at it. What are your thoughts on precincts? There are a bunch of formats currently:

0001 0001 1 Full Name 0001 1 PREC #1 PREC 1 1 Full Name FULL NAME 1 Full Name 1 FULL NAME 1 - Full Name

Should we be enforcing any kind of consistency here?

nk9 commented 7 years ago

In Mississippi, we have been using 'X' to represent precincts that are not a part of specific legislative districts.

That's in the precinct field, though, right? Am I still right to say that this should have "0" for the votes?

{'votes': 'X', 'district': '27', 'candidate': 'PHILIBEN, ANNE N', 'office': 'State Senate', 'county': 'Wasco', 'party': 'DEM', 'precinct': 'PREC 19'}
dwillis commented 7 years ago

No, in Mississippi we do have "X" in the votes column for precincts that aren't part of the legislative district. Not in the precinct column. https://github.com/openelections/openelections-data-ms/blob/master/2015/20151103__ms__general__clay__precinct.csv#L368

dwillis commented 7 years ago

Ran the verifier script (thanks!) and yeah, the pseudo-candidate names are definitely an issue in this repo (and likely others). We have a couple of options:

  1. Clean these up in this repo.
  2. Clean them up as we load them into our processing pipeline, which we've done for other states.

I'd kinda prefer 1.

nk9 commented 7 years ago

Considering that changing them in bulk is as easy as making a sed/perl one-liner, let's just do that.

nk9 commented 7 years ago

I see the "X for excluded precincts" being used in the Wasco 2002 primary as well. Can't we just remove these lines entirely? The only reason they are there is to pad out the precinct × candidate grid. Other counties, which provide data by precinct, simply don't list contests which didn't appear on the ballot in a given precinct. Or they list only the precincts relevant to a given contest. Like this from Lane County's 2008 primary:

   DEM State Representative 7th District

                                G N M             W
                                 . i c             R
                                    c K    D N      I
                                     k i    o o      T      O V     U V
                                        b    n r      E      V O     N O
                                         b    a d      -      E T     D T
                                          i    l i      I      R E     E E
                                           n    d n      N        S     R S
                                       -----   -----   -----   -----   -----
0005 100007 Blue River                   169     201       2       0     195
0007 100009 Camas                         38      48       1       0      72
0017 100097 Latham                       142     267       1       0     214
0019 100102 Lowell                       140     149       6       1     133
0022 100107 Mosby                        227     421       9       0     361
0024 100112 Pleasant Hill 2               59      60       1       0      80
0026 100119 Salmon Creek                  82      76       3       0      78
0035 100300 Cottage Grove                353     590       1       0     371
0041 100800 Oakridge                     145     226       3       0     145
                              TOTALS    1355    2038      27       1    1649
nk9 commented 7 years ago
  1. Party included for "Write-in," etc lines in primaries, not for generals/specials

And leave it to Wasco to throw us another curve ball: their abstract xlsx seems to merge the Dem and Rep under/over votes in the 2000 primary. Fortunately not in any of the other elections! Don't think I'm going to accomodate this quirk in the verifier.

nk9 commented 7 years ago

@dwillis I have a question about filename formatting. I see these:

20120131__or__special__general__clatsop__house__1__precinct.csv
20111108__or__special__primary__clatsop__house__1__precinct.csv

Which implies that special is just a qualifier for the standard primary/general dichotomy. But then there's also this:

20120131__or__special__washington__house__1__precinct.csv

Where a special is assumed to be a general. Which approach is correct? Or can it really be either?

Also, it would be helpful to get an answer to the "X for district" follow-up I asked on 29 Nov.

dwillis commented 7 years ago

@nk9: yeah, that 2012 Washington special should be renamed to include general. Good catch. In terms of the X for district, you can ditch those lines.