ron-rivest / 2017-bayes-audit

Repository for paper "Bayesian Tabulation Audits Explained and Extended"
MIT License
0 stars 0 forks source link

Synthetic data for Colorado, for testing ColoradoRLA #10

Open nealmcb opened 7 years ago

nealmcb commented 7 years ago

The ColoradoRLA project could use some realistic test data, and it looks like this code base might be able to generate it.

In particular, the 2017 off-year election in Colorado will have a similar set of contests to the 2015 election. In order to simulate and test the way ColoradoRLA audit will work, and for testing in late August and early September, it would be very helpful to have synthetic CVR data corresponding to the actual vote counts from the contests in 2015.

That data is available in the 2017-rivest-voting-group repository (/data/2015/colorado-clarity/county-csvs/), and directly from 2015 CO - Election Results.

For the ColoradoRLA tool, the CVRs should be in the Dominion CSV CVR format, exactly like the data at ColoradoRLA/arapahoe-regent-3-clear-CVR_Export.csv. Note the non-standard use of multiple header lines, for contest names, candidate names and party affiliations. A larger, more authentic example is at ColoradoRLA/dominion-2017-CVR_Export_20170310104116.csv , which is yet stranger CSV, but is the actual current format. A cleaner valid CSV version is at ColoradoRLA/dominion-2017-CVR_Export_20170310104116-clean.csv

The Clarity data doesn't indicate batch size etc, so to start with, it's fine to just use a single TabulatorNum value, say "1", and a single BatchID value, say "1", or perhaps mostly uniform batches of 100 or 200 in size. CountingGroup and PrecincPortion are also not important.

Note that the rows are not ballots, but ballot cards. Most ballots have two cards. It's fine to simplify at first to produce synthetic data in which the simulated CVRs have all contests on a single card.

If anyone would be willing to help us out, we would love it, and it would help advance auditing in Colorado, the US and the world!

nealmcb commented 7 years ago

Ideally, eventually, the data would have realistic batch sizes, and would make some attempt to match the non-random distribution of votes by choice in each batch, and even within each batch. Access to some existing large CVRs can be arranged to derive those distributions.

That is especially important for noCVR ballot-polling audits, and simulating the effects of a variety of ways of doing more efficient selection procedures for the ballots.