Synthetic data for Colorado, for testing ColoradoRLA

The ColoradoRLA project could use some realistic test data, and it looks like this code base might be able to generate it.

In particular, the 2017 off-year election in Colorado will have a similar set of contests to the 2015 election. In order to simulate and test the way ColoradoRLA audit will work, and for testing in late August and early September, it would be very helpful to have synthetic CVR data corresponding to the actual vote counts from the contests in 2015.

That data is available in the 2017-rivest-voting-group repository (/data/2015/colorado-clarity/county-csvs/), and directly from 2015 CO - Election Results.

For the ColoradoRLA tool, the CVRs should be in the Dominion CSV CVR format, exactly like the data at ColoradoRLA/arapahoe-regent-3-clear-CVR_Export.csv. Note the non-standard use of multiple header lines, for contest names, candidate names and party affiliations. A larger, more authentic example is at ColoradoRLA/dominion-2017-CVR_Export_20170310104116.csv , which is yet stranger CSV, but is the actual current format. A cleaner valid CSV version is at ColoradoRLA/dominion-2017-CVR_Export_20170310104116-clean.csv

The Clarity data doesn't indicate batch size etc, so to start with, it's fine to just use a single TabulatorNum value, say "1", and a single BatchID value, say "1", or perhaps mostly uniform batches of 100 or 200 in size. CountingGroup and PrecincPortion are also not important.

Note that the rows are not ballots, but ballot cards. Most ballots have two cards. It's fine to simplify at first to produce synthetic data in which the simulated CVRs have all contests on a single card.

If anyone would be willing to help us out, we would love it, and it would help advance auditing in Colorado, the US and the world!

ron-rivest / 2017-bayes-audit

Synthetic data for Colorado, for testing ColoradoRLA #10