Open dwillis opened 2 years ago
@dwillis I'd love to help but not sure where to start. I was looking at this file for Anderson County and not sure how to parse the first set of numbers. I'm sure others will do this quickly and/or better, but felt like I could contribute a few hours if it's helpful. Thanks for all the work you do!
@thefuturewasnow thanks! Yeah, those txt files can be tricky, so I might suggest something a bit easier like maybe Bee County, which basically involves reworking a CSV file into a slightly different format. You can see the result we want by looking at files I've converted for Austin, Bailey and Bandera counties in the 2022 folder.
I can get some of this, too. Is it okay if I add a code
directory within the 2022 directory if I wrote a Python script?
@ssdatar thanks! we've got a python-parsers
directory in the base directory of the repository if that works?
Yep, I'll add it to the python-parsers
directory.
@dwillis Sorry, quick question. I found some counties are in this format. Are these '000' separated?
I can see that in some counties, the last digits in the first column are the vote totals. Is that accurate? Just trying to understand the organization. If there's any documentation or a parser already in the repo that works with these kinds of files, please let me know. Thanks!
@ssdatar no worries! these are actually fixed-width files, not delimited by '000' or anything. Depending on the number of votes-related columns, we've got multiple parsers for them in the python-parsers
directory that begin with asc
. I've been working my way through some of those.
@dwillis - I think that I've figured out the general parsing and am working on a short R program to do that. Is it possible that we may want to have an R-parsers directory as well as a python-parsers directory? I know that Python is more the standard but both languages are similar and in heavy use. In any case, I've run into a few issues. One is that there's a lack of standardization in the input file names. I have a list of the Texas counties so I could possibly cycle through that list and read the files that begin with the county name. Each county should have one with "DEMOCRATIC" in it and one with "REPUBLICAN" in it. That's assuming that the names follow that standard. But then I noticed that some of the counties are missing. Specifically, Harris and Maverick Counties are missing but there are likely others.
All of this suggests that it would be best if I could download the entire directory so that I could run the program on it. Is there a simple way to do that? I tried to clone the Texas repository but it was too huge for the free space on my drive. However, I also realized that I might need to do that if I am going to check anything into the repository at some point. Do we have a document that describes how we should do these basic operations on this repository? Thanks.
Using Tabula, OCR or whatever method you can, parse precinct-level results for the following counties. Original sources are in the sources-tx repository.
The goal is to create a single CSV file for each county, with the following headers:
county
,precinct
,office
,district
,party
,candidate
,votes
If the county file also provides a breakdown of votes by method, include that using the following headers:
early_voting
,election_day
,provisional
,mail
If there are other possible vote types, include them, using a lowercase version of the vote type with underscores instead of spaces for the column name.
Include the following offices:
If a county provides precinct results for Write-in candidates, they should be grouped in a single row for each precinct and office with a
candidate
value ofWrite-ins
.If a county provides Under Votes or Over Votes, those should be recorded in the same way, with a single row per precinct and office with
Over Votes
andUnder Votes
as thecandidate
values.