kuriwaki / cvr_harvard-mit_scripts

6 stars 1 forks source link

Empty cell values in raw spreadsheet #123

Open kuriwaki opened 6 months ago

kuriwaki commented 6 months ago

Dane county's (WI) raw CVR has the following values of President:

image

This is what both Jim and Mason's final database gives, after it has been cleaned. Notice there is no red NA; all other values are the same

image

The 1137 "NAs" got dropped here, as if the office was not available on that ballot. However, that is implausible given this is US President. In fact the official county certification reports 1,146 write-in votes, of which 808 were "SCATTERING" and about 216 were for Hawkins, the Green party candidate who failed to get on the ballot.

So, it seems like in some cases, these empty cell values should be "WRITE-IN" and there should be some vote entry in the long data. However, we don't have a great method to determine if it's that or the contest was not on the ballot (e.g. a split/fragmented paginated ballot).

mreece13 commented 6 months ago

Hmm, this is going to be exceptionally hard to address. It does not appear consistently even within Wisconsin, I checked out all of the counties and it does not even seem to be occurring in every county where there are NAs (ie, some of them are undervotes). Perhaps we can proceed on a case by case basis and write some code that can at least detect when this occurs in the President race. I can also think of a potential solution for the other contests, but it will likely not be a flawless system.

I am also a bit suspicious that the NAs don't exactly add up to the total reported write-in votes, but perhaps by re-adding the missing border precincts we would get to the correct number.

kuriwaki commented 6 months ago

Agree all around. I might rank that ("case by case basis and write some code that can at least detect when this occurs in the President race") somewhat highly, because I do think some users of the data are going to be interested in analyzing third party voters with this data (given the attention to RFK in 2024, as in the Lewis and Herron CVR article, "Did Ralph Nader Spoil Al Gore's Presidency")

By the way, I think one way this kind of NA gets produced in the data is when the actual CVR is an Excel with jpeg images for their write-ins. Here is an example from Bay, FL. This is what the Excel file looks like (note the "Mark Rogers" write-in for Congress):

image

and in the csv version, the cell is blank where the jpeg images are.

kuriwaki commented 4 months ago

With a few spot checks, I think this is specific to ES&S DS200 machines. Noted in the paper.