kuriwaki / cvr_harvard-mit_scripts

6 stars 1 forks source link

Issues with duplicate records #308

Closed kuriwaki closed 2 months ago

kuriwaki commented 2 months ago

In the data to be released, there are four five counties which include cvr_ids each with two records for President. There should be just one.

These are

There may be other duplicates which affect congressional races (I only looked at president). From this sample, it looks like all of these duplicates are for fringe Presidential candidates.

suppressPackageStartupMessages(
  {
    library(tidyverse)
    library(arrow)
  }
)

ds <- open_dataset("release")

# Two rows per President
duplicated <- ds  |> 
  filter(office == "US PRESIDENT") |>  
  count(state, county_name, cvr_id) |> 
  filter(n >= 2)

## examples
ds |> 
  filter(office == "US PRESIDENT") |> 
  semi_join(duplicated) |> 
  collect() |> 
  select(county_name, cvr_id, precinct, candidate) |> 
  slice_sample(n = 2, by = c(cvr_id, county_name)) |> 
  arrange(cvr_id)
#> # A tibble: 1,466 × 4
#>    county_name cvr_id precinct             candidate                 
#>    <chr>        <int> <chr>                <chr>                     
#>  1 BROWN           73 Village of Howard W5 BRIAN T CARROLL           
#>  2 BROWN           73 Village of Howard W5 BRIAN T CARROLL           
#>  3 SAN JUAN       107 Precinct 11          KANYE WEST                
#>  4 SAN JUAN       107 Precinct 11          KANYE WEST                
#>  5 SAN JUAN       350 Precinct 2MC         JOE MCHUGH ELIZABETH STORM
#>  6 SAN JUAN       350 Precinct 2MC         JOE MCHUGH ELIZABETH STORM
#>  7 BROWN          368 Village of Howard W5 BRIAN T CARROLL           
#>  8 BROWN          368 Village of Howard W5 BRIAN T CARROLL           
#>  9 SAN JUAN       480 Precinct 5BD         JOE MCHUGH ELIZABETH STORM
#> 10 SAN JUAN       480 Precinct 5BD         JOE MCHUGH ELIZABETH STORM
#> # ℹ 1,456 more rows

# which counties
duplicated |> 
  collect() |> 
  count(county_name)
#> # A tibble: 5 × 2
#>   county_name     n
#>   <chr>       <int>
#> 1 BROWN         206
#> 2 KENOSHA       120
#> 3 SAN DIEGO      12
#> 4 SAN JUAN       64
#> 5 WAUKESHA      331

# how many voters in how many precincts?
ds |> 
  filter(office == "US PRESIDENT") |> 
  semi_join(duplicated) |> 
  summarize(
    n_records_dup = n(),
    n_precincts = n_distinct(precinct),
    .by = county_name
    ) |> 
  collect()
#> # A tibble: 5 × 3
#>   county_name n_records_dup n_precincts
#>   <chr>               <int>       <int>
#> 1 SAN DIEGO             120           2
#> 2 KENOSHA               240          60
#> 3 BROWN                 412          73
#> 4 WAUKESHA              662         130
#> 5 SAN JUAN              128           9

# how many precincts are there to begin with
ds |> 
  filter(county_name %in% c("BROWN", "WAUKESHA", "KENOSHA"),
         state == "WISCONSIN") |> 
  summarize(.by = county_name, n_distinct(precinct)) |> 
  collect()
#> # A tibble: 3 × 2
#>   county_name `n_distinct(precinct)`
#>   <chr>                        <int>
#> 1 BROWN                           96
#> 2 KENOSHA                        120
#> 3 WAUKESHA                       190

Created on 2024-06-27 with reprex v2.1.0

mreece13 commented 2 months ago

All good catches, there was some inconsistent assignment of the parties in our manual classification of new parties (part of pass2). I've caught these for President and about 20 in other down-ballot contests. They should be ready for the next build.

kuriwaki commented 2 months ago

This is not fixed in San Juan, UT and that is producing duplicates. Should I make an adhoc fix at the end?

mreece13 commented 2 months ago

Just fixed it for San Juan in the medsl/ data, can remove the ad-hoc fix if you want.

kuriwaki commented 2 months ago

Thanks. President is now all deduplicated! However, I did check the other offices now. I found the duplicates in the following. Going to reopen this issue, if these are small enough to address in this round.

I checked one county (Charlevoix), and it also seemed like fringe party voters getting two due to two spellings of a party -- e.g. NLP and NATURAL LAW)

US SENATE:

US HOUSE

STATE SENATE

STATE HOUSE

mreece13 commented 2 months ago

Can you update this list with the new MEDSL data? It should've been recently updated.

kuriwaki commented 2 months ago

Senates are now all fixed! A few remaining ones in the us/state house, which I researched and updated in the list above.

Lee, GA is a county I did not catch yesterday but should have. Lake, FL is also a county I did not catch yesterday but should have. I can fix downstream too. Orange, FL is from yesterday and looks a typical fringe candidate case, maybe you missed one of them Seneca, OH is from yesterday and the state house records are complicated. Not sure. We could just not release it

(more details above, I also verified these issues in the original medsl/ version)

If rerunning scripts take some time, one of us can fix it downstream as in #329 too.

mreece13 commented 2 months ago

I believe I have resolved all of the remaining issues, and they should all be present in medsl/ in the Dropbox.

kuriwaki commented 2 months ago

Ohio Seneca was still giving me duplicates, but I fixed it downstream in the referenced commit above. I'll let you review the new PR, and if it looks good, I can merge.

mreece13 commented 2 months ago

You sure there are still duplicates in Seneca, OH? I'm not getting any. Their election results also show that both 87 and 88 are districts in the county.

library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:lubridate':
#> 
#>     duration
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

open_dataset("~/Dropbox (MIT)/Research/CVR_parquet/medsl/") |> 
  filter(state == "OHIO", office == "STATE HOUSE", county_name == "SENECA") |> 
  count(cvr_id, district) |> 
  filter(n>1) |> 
  collect()
#> # A tibble: 0 × 3
#> # ℹ 3 variables: cvr_id <int>, district <chr>, n <int>

Created on 2024-07-08 with reprex v2.1.0

kuriwaki commented 2 months ago

If you remove the district variable in your count(), you will see the duplicates. And yes, that's the tricky thing about this county -- the county includes both districts, but it can't be that a single cvr_id voted in two districts. One of them must be bug, but it's not immediately obvious which one. In the above, I explain my reasoning why I would opt to remove the 87s.