kuriwaki / cvr_harvard-mit_scripts

6 stars 1 forks source link

[CA] Contra Costa #300

Closed aconevska closed 2 months ago

aconevska commented 2 months ago

Recommendation: Use Harvard. (MEDSL data might also be in the same state as H after the code with Mason's updated 'fragmentation' parser finishes running but see my thoughts on this below.)

The Contra Costa county office reports 581230 total votes for president. Harvard has 581322 total votes for president, 92 more than the county total. These 92 extra votes come from: Brock Pierce (3), Joesph Kishore (2), Jesse Ventura (10), Brian Carroll (57), Mark Charles (14), Write-in (6), which Contra Costa county does not include in their total. (The county might have a threshold of 100 votes or greater for any candidate to be added to the official tally?) (Also note that the cvr_status spreadsheet has 575173 votes for uspres_votes_h, which I can update. I think Jim may have made some changes the Contra Costa file, otherwise not sure why the discrepancy.)

So we basically do better than the county itself! We can make a note to drop these if we want to reflect official totals better but will be inconsequential.

On MEDSL, I'm not sure fragmentation is the primary issue but very likely. I think we recover roughly 150000 votes with our recovery method, and MEDSL is 51835 short, so I would think they are close if fragmentation is their only issue. After Mason's updated parser is finished running, I can check this! @mreece13 perhaps just let me know when its done?

mreece13 commented 2 months ago

Yes, I will let you know! Unfortunately my fragmentation code is very slow, Contra Costa has been running on the cluster for ~20 hours and it's still not done yet. It'll probably be a few days before everything trickles down through the pipeline.

mreece13 commented 2 months ago

Contra Costa now matches perfectly in the MEDSL data.

kuriwaki commented 2 months ago

Great, mason. will keep this closed, but now harvard is 200 votes short so am adding the #fix-harvard tag. (cc @aconevska)

aconevska commented 2 months ago

@kuriwaki thank you! Do you mean we are 200 votes shorts after you re-ran "01_direct-to-parquet-loop.R"? Just looking for where to start since when I ran my checks just last week, it was perfect. (Aside from the third party presidential candidate votes not counted by Contra Costa county.)

kuriwaki commented 2 months ago

Re: "Do you mean we are 200 votes shorts after you re-ran "01_direct-to-parquet-loop.R"? "

Yes harvard is about 140 votes short for Biden, for example, after I run the whole set of scripts in 01_direct-to-parquet-loop.R. Is your perfect result in the original _long.dta? If so, then the issue might be how we are deduping. (https://github.com/kuriwaki/cvr_harvard-mit_scripts/issues/296#issuecomment-2197357404)

aconevska commented 2 months ago

Is your perfect result in the original _long.dta? If so, then the issue might be how we are deduping.

Yes, exactly. The "STATA_long/CA_Contra_Costa_long.dta" file has the exact right count.

I can go over "01_direct-to-parquet-loop.R" to see if I can identify the de-duping as the source? Can write a replicable example if I find any evidence.