Closed warwickmm closed 3 years ago
@warwickmm Because states and counties report results in different ways, there often is a certain amount of variety in precinct-results files. The basic format is county,precinct,office,district,party,candidate,votes
, and the order doesn't matter but we try to aim for that order for readability. The column names should be lower cased; if they are not that's a great fix to make. And removing the byte order mark is helpful, too.
We also should ditch empty columns and any with row number values (they are used by some of our folks in converting from PDFs).
Some state-specific repos have some verification code (Oregon is an example: https://github.com/openelections/openelections-data-or/blob/master/src/verifier.py), and it would be great to have a broader suite of verifiers to use across repositories or on any CSV file.
Thanks @dwillis. I'll work on adding a simple unit test to check for some of these inconsistencies. This way, we can at least do some minimal verification for new pull requests. We can later think about how to consolidate the verifier/test code so that it can be used across repositories. Where would such code live?
While we're discussing tests, would there be any interest in moving from Travis to using GitHub workflows to run the tests? GitHub workflows seem to be the more powerful/flexible option now. Travis has also made some questionable decisions recently, including moving towards a paid model where open source projects are only given a limited number of credits before more need to be purchased/requested.
@warwickmm Thanks! That sounds like a good plan. I think down the line verifier code could live as its own package. And absolutely willing to move to GitHub workflows for running the tests.
Travis has also made some questionable decisions recently, including moving towards a paid model where open source projects are only given a limited number of credits before more need to be purchased/requested.
Note that GitHub Actions also charges above a certain number of runs per month, but both GitHub Actions and TravisCI claim to be free for Open Source projects (you may have to qualify/appeal for this perk).
IMHO Actions are slightly nicer because integrating existing Actions from the GitHub Marketplace is a little easier, but ymmv.
It sounds like Travis credits have to be requested every time you run out, or negotiated for a renewable amount. I believe that GitHub actions require less "maintenance".
Regardless, I also prefer GitHub actions for the additional ease and flexibility.
I was trying to think of some simple unit tests to verify the formats of the csv files. However, it's not clear to me what the expected format should be, if there is one.
For example, I have found that files whose names do not end in
precinct.csv
all appear to have the following header:Files whose names end with
precinct.csv
have a variety of differences. Most have the following header:However, there are many that do not. Some differences include:
county,precinct,office,district,party,candidate,votes
.county,precinct,office,district,party,candidate,votes
appearing in a different order.,county,precinct,office,district,party,candidate,votes
), whose row values are just the row number.county,precinct,office,district,party,candidate,votes,
) whose row values are all empty.County,Precinct,Office,District,Party,Candidate,Votes,election_day,mail
\ufeff
.Do we have an expected file format, and is there any interest in having tests to verify the existing data?