Closed kuriwaki closed 2 months ago
All good catches, there was some inconsistent assignment of the parties in our manual classification of new parties (part of pass2
). I've caught these for President and about 20 in other down-ballot contests. They should be ready for the next build.
This is not fixed in San Juan, UT and that is producing duplicates. Should I make an adhoc fix at the end?
Just fixed it for San Juan in the medsl/
data, can remove the ad-hoc fix if you want.
Thanks. President is now all deduplicated! However, I did check the other offices now. I found the duplicates in the following. Going to reopen this issue, if these are small enough to address in this round.
I checked one county (Charlevoix), and it also seemed like fringe party voters getting two due to two spellings of a party -- e.g. NLP and NATURAL LAW)
US SENATE:
US HOUSE
c(100000, 200000, 300000, 400000)
each are triplets for the district. Causing same issue for state house. They look like header rows that are not actual votes. You or I can simply delete these 4 cvr_ids from the dataset. STATE SENATE
STATE HOUSE
TIFFIN
, have two values for state house: 087 and 088. According to their detailed report, there should be NO district 87 in TIFFIN precincts. So I think we should delete only the records that are cvr_id duplicates and have district 87, but I'm not sureCan you update this list with the new MEDSL data? It should've been recently updated.
Senates are now all fixed! A few remaining ones in the us/state house, which I researched and updated in the list above.
Lee, GA is a county I did not catch yesterday but should have. Lake, FL is also a county I did not catch yesterday but should have. I can fix downstream too. Orange, FL is from yesterday and looks a typical fringe candidate case, maybe you missed one of them Seneca, OH is from yesterday and the state house records are complicated. Not sure. We could just not release it
(more details above, I also verified these issues in the original medsl/ version)
If rerunning scripts take some time, one of us can fix it downstream as in #329 too.
I believe I have resolved all of the remaining issues, and they should all be present in medsl/
in the Dropbox.
Ohio Seneca was still giving me duplicates, but I fixed it downstream in the referenced commit above. I'll let you review the new PR, and if it looks good, I can merge.
You sure there are still duplicates in Seneca, OH? I'm not getting any. Their election results also show that both 87 and 88 are districts in the county.
library(tidyverse)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:lubridate':
#>
#> duration
#> The following object is masked from 'package:utils':
#>
#> timestamp
open_dataset("~/Dropbox (MIT)/Research/CVR_parquet/medsl/") |>
filter(state == "OHIO", office == "STATE HOUSE", county_name == "SENECA") |>
count(cvr_id, district) |>
filter(n>1) |>
collect()
#> # A tibble: 0 × 3
#> # ℹ 3 variables: cvr_id <int>, district <chr>, n <int>
Created on 2024-07-08 with reprex v2.1.0
If you remove the district
variable in your count()
, you will see the duplicates. And yes, that's the tricky thing about this county -- the county includes both districts, but it can't be that a single cvr_id
voted in two districts. One of them must be bug, but it's not immediately obvious which one. In the above, I explain my reasoning why I would opt to remove the 87s.
In the data to be released, there are four five counties which include cvr_ids each with two records for President. There should be just one.
These are
There may be other duplicates which affect congressional races (I only looked at president). From this sample, it looks like all of these duplicates are for fringe Presidential candidates.
Created on 2024-06-27 with reprex v2.1.0