kuriwaki / cvr_harvard-mit_scripts

6 stars 1 forks source link

[CA] Sonoma #296

Closed aconevska closed 2 months ago

aconevska commented 2 months ago

Recommendation: Use Harvard data

The Sonoma county office reports 268317 total ballots cast for President in 2020. The cleaned Harvard data contains 267927 votes for the President. I cannot identify exactly why Harvard is off by 390 votes but my hunch is: 780 votes for President in the raw Sonoma "cvr.csv" file are marked as "redacted for voter privacy" (and we of course do not count these), which is exactly two times the number of Presidential votes Harvard is missing. It's possible that half of these were also marked as overvotes - so not listed in the county's offical count - but we cannot know for sure because the Sonoma "cvr.csv" file does not distinguish overvotes (or undervotes but we can manually identify those).

Note that the cvr-status spreadsheet reports Harvard's total Presidential votes for Sonoma as 263409 but we have 267927 total and 261383 for Biden or Trump.

If the numbers for MEDSL Somona President counts are correct in the cvr-status spreadsheet, they are off by 20041 votes the President. The most obvious reason why this may be the case is if they are not counting "Election Day" votes, which amounted to 15818 for the President (15328 for Biden and Trump specifically). In the raw Sonoma "cvr.csv" file these are marked as countgroup == "Polling Place", of which Harvard has 15623 total for President. Note that this is 195 votes short of what Sonoma county officially reports for "Election Day" votes for President, which is half of 390.. the total number Harvard is off by for all values of countgroup (or vote types). So half of our missing count are likely "Election Day" votes that were redacted for privacy. Indeed, if we look at the cross tab of presidential vote choice and the "countgroup" variable in the raw Sonoma data, 396 of the "redacted for voter privacy" observations were countgroup == "Polling Place"

mreece13 commented 2 months ago

Pending a new build on the MEDSL side, but we are now very close. The missing votes are because Sonoma redacted some precincts for voter privacy. Running table(d$President and Vice President Vote for 1 Joseph p Biden Dem) on the raw file will show this clearly. I think that the Harvard team needs to re-run their fragmentation parser since the raw votes for Biden do not match the raw total reported (without redactions). @kuriwaki

aconevska commented 2 months ago

@mreece13 Ok we can re-run for sure! But I'm not sure I follow why our miscount is likely from unrecovered fragments? Were the missing votes for MEDSL all due to the redacted precincts?

mreece13 commented 2 months ago

I'm not too sure? But if you read the raw cvr.csv, before applying any of the fragmentation procedure is applied, the totals for Biden should be 199,722. 780 voters are redacted. The status sheet right now shows Harvard reporting 199,000 votes, although I'm not sure when that was last updated. It's neither the correct raw total nor does it appear like all the redacted voters were miscounted, so I'm not too sure what has gone wrong?

Sorry I can't be of more help, I'm not sure how y'all are exactly processing the fragmented files.

aconevska commented 2 months ago

Ok. So Jim's "CA_Sonoma_long.dta" processed file has the right total for Biden - 199,722. But after running Shiro's R scripts to compile all counties to parquet and recover fragments, "direct_to_parquet_loop.R" and "merge-party_snyder.R", I end up with 198,853 votes for Biden in Somona county. Tagging @kuriwaki because I believe he wrote all of that code, and I just ran it locally for California yesterday for the first time, since Jeff and Jim had updated several CA counties earlier this month.

So I think you're right that maybe it's happening within the fragmentation procedure. But its not totally clear. When I manually cleaned Sonoma and posted the above issue earlier this week, I was 390 votes off for all presidential votes, which is different now than what is off in the compiled parquet files - 869 just for Biden.

I will keep looking into this later today. Apologies if it takes me some time.

kuriwaki commented 2 months ago

One thing that happens in the code you mention is that I de-duplicate cvr_id - column duplicates https://github.com/kuriwaki/cvr_harvard-mit_scripts/blob/aa712045c66e1c6456872b58297a9418ddbdf73c/R/build-harvard/01_direct-to-parquet-loop.R#L84-L91 I can't think of a good reason why Jim's _long data should contain duplicates, but it is there in some counties.

Another thing that the script does is, yes, fuse together fragmented ballots.

kuriwaki commented 2 months ago

I am closing issues that are only #fix-harvard (but MEDSL is correct) as "not planned" for now. The Harvard team should revisit them later by using the hashtag.