usnistgov / CastVoteRecords

Common data format specification for cast vote records
https://pages.nist.gov/CastVoteRecords
Other
20 stars 2 forks source link

Add "Batch Manifest" to the CVR #30

Open raylutz opened 3 years ago

raylutz commented 3 years ago

Organization Name: Citizens Oversight

Organization Type: 2 (Nonprofit, developer of AuditEngine)

Document (e.g., CastVoteRecords): CastVoteRecords

Reference (Include section and paragraph number): General

Comment (Include rationale for comment):

Currently, the CVR provides commonly chunks of data representing the output of a tabulator, such as a tabulator in a polling place. There may be a vast number of these chunks, perhaps over 10,000, where each might have 200 ballots described. Another option is to combine all the chunks into a single file, which can grow to be very large and become unmanageable. Maintaining these as chunks is therefore a viable approach to allow the standard to scale to any number of ballots without resulting in files that become unwieldy. Since this can be used at all scales, using smaller chunks, representing batches, should be recommended.

Also, it should be recommended that election operations organize physical ballots in the same batches.

There is nothing included in the standard at this time to describe these chunks, and make sure that all the chunks are included, and unaltered. Indeed, in general, there is very little "self description" in the standard, and the result is a set of files, say encoded as JSON, without any master description of those files. Most of the files that result from the implementation of the standard are a simple "dict" (list of name and value pairs) or "list of dict", which can be viewed as a table of rows, where the JSON or XML allows sparse data population in those rows. The number of rows is not further defined in the standard. In other words, the number of rows of data, i.e. the number of ballots in a given chunk is not limited nor described anywhere, and the number of chunks is similarly not limited nor described.

Therefore, we propose a Batch Manifest to accomplish both of the above goals for the primary data of the CVR, which is the list of ballots in batches.

In terms of structure, we propose that the batch manifest be a simple list-of-dict structure, that can be easily mapped to a table with rows and columns. Such a batch manifest can be generated by the exporting function, and does not alter the meaning of any fields in the CVR definition.

field type description
batchid str typically a combination of tabulator and batch, to create a unique batchid across all batches
tabulator int TabulatorId
batch int batch number produced by that tabulator, starts at 1 for each tabulator, not unique across tabulators
locationid str (optional) physical box or location of the physical ballots (if different from batchid
first_ballot_idx int The first ballot_idx in the batch, where the ballot_idx is nominally an integer that starts at 1
count int The number of ballots in the batch
cvr_basename str The filename, without path, of the file containing the batch, like "CvrExport_\d+.json"
batch_hash_digest str String describing the secure hash digest of the file, like "SHA256=[0-9a-f]{64}"

Typically, a single chunk of JSON describing batch will include only one batchid internally.

It may be advisable to allow the cvrbasename to contain the batchid, such as "CvrExport{batchid}.json"

We suggest that additional named fields be allowed to be defined by implementations for internal use, to be ignored if they are not used by readers of the format.

It may be necessary to have yet another table to provide a cross reference between the locationid and batchid, if not included in the table.

Suggested Change: See above

Organization Type: 1 = Federal, 2 = Industry, 3 = Academia, 4 = Self, 5 = Other

JDziurlaj commented 3 years ago

Hi @raylutz, can you describe how large the CVRs can get before processing them poses a problem?

raylutz commented 3 years ago

Hi John:

You ask how large Cvr can be before processing becomes a problem. This depends on the methods used to process them, and capacity of the machine, so there is no hard answer. Some cloud-based machines are limited to 250MB total memory. Also, cloud-based machines can't incrementally read json, because you either read the entire file or not, unless you know exactly what offsets to pull out the contents. JSON has no directory so it is not possible to pull out an arbitrary section from the middle.

Most of the files in the CVR are small, and are simple list-of-dict structure. Those are not a problem. The only one is the CvrExport file which is either all one file or split into CvrExport_n.json, where n is an integer like 0,1,2, etc.

Dominion now creates a CVR output similar to the NIST standard. 50K ballots result in a zip file which is 6.66 MB zipped. Compressing with ZIP is common practice but the standard does not specify an archive format or compression. When it is unzipped, it is 255MB. This JSON file will cause some free JSON readers to choke. It cannot be read by limited-memory cloud machines like AWS Lambda.

Dominion offers two output options, 1) one big file or 2) batches, one per json. We specify that we want the output from Dominion to be exported as batches.

If you zip the chunks, they are only slightly larger, at 7.59 MB. But the nice thing about ZIP format, (as opposed to say tar) is that it does not compress the whole file, it compresses each file, and stores them in the archive. Thus, it is possible to pull out individual json chunks, which are only maybe about 300K, because each one is a separate file and the ZIP archive provides the location. Basically you just read that file from the zip.

The use of chunks can scale to any size election without breaking memory. To give you an idea, the entire CVR in Arizona, (2.1 M ballot sheets) is 1.9 GB zipped. If you unzip it all at once, it is 33GB.  There are 10,344 chunks.

--Ray

On 6/24/2021 9:36 AM, John Dziurlaj wrote:

Hi @raylutz https://github.com/raylutz, can you describe how large the CVRs can get before processing them poses a problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/usnistgov/CastVoteRecords/issues/30#issuecomment-867788174, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSDLSJRQW23PZP6DKOOQ6TTUNNHPANCNFSM47EDGKDA.

--

Ray Lutz Citizens' Oversight Projects (COPs) http://www.citizensoversight.org 619-820-5321

JDziurlaj commented 3 years ago

So if I am understanding this correctly, this proposal seeks to have random-access through a set of NIST CVRs, so that a particular CVR or set of CVRs can be easily retrieved?.A secondary goal is to provide a structured means to hold the hashes of each CVR file. Are there any other use-cases of this proposal?

raylutz commented 3 years ago

This is intended mainly as a means to bundle up a large set of CvrExport.json files, maybe many thousands of them, and lock them into a package. So that when we get the package, then we can identify all of the components. There is an additional issue that we need to address, that is that there may be several versions of the CVR which are released at different times. This is partially different from the snapshot concept which is used for adjudications.

It is common, for example, for there to be several stages of release, such as election night, semi-final unofficial, and final. I don't think the other manifest files need to be changed at all but it would change the set of CvrExport.json files and the batch manifest. In such a staged release, only the set of batches will likely change, not the individual batches. So for example, let's say that in the election night report, they have batches 0001 to 0100. Then when VBM ballots are fully processed, they have batches 0001 to 0200. And when provisionals are added to the set, there are few more added. Each one of the batches is still the same, and would have the same hash but the set includes more batches. Thus the batch manifest should have a version or perhaps better called a stage indicator.

Due to the need to maintain a locked-down cvr set as they are processed, the snapshots should not be added inside the cvr (unless the adjudication is done in real time as they are scanned), but should be added as a separate file. This is different from the way the CVR has been conceptualized and implemented, because it is viewed that the snapshots would exist in the same file, and it would be modified to include the adjudication information as a separate "Version". I commented on this when the CVR was being designed but my comments were not fully embraced. Each one of the CvrExport.json versions should be "immutable" in that as you process the canvass, the data from the prior stages should remain unchanged, but you can add additional information only as new files. It can likely still be embraced if we add the snapshot information when it is actually provided, which may be after the first stages have already been completed, and then a separte adjudication phase is included in the process after the fact, and that should not alter the CVR files already produced and locked down with a hash code, but is added later in a separate CvrExport file. If adjudication is done in real time, as I think is the case for Dominion, then the "Modified" snapshot could be included in the original CvrExport file adjacent to the "Original" version.

If adjudication is done later, then it is best to not go back and alter the CvrExport file because that would change the hash code and then the batch manifest.

Thus it is essential to be able to find that and know that the record as been changed, because the changed record may not be embedded in the original CVR file if we are to respect the requirement of immutability.

In terms of random access, yes, it is also a need with very large data sets that will not easily fit in memory, because the ballot images are not usually organized the same way as the cvrs, they are combined in to ZIP files. So if we process a set of ballot images and generate an independent tabulation from those images, then they will be in the natural order as we found them in the zip archive, or their apparent location, and then in the order of the archives processed. To compare the two tabulations, then it is necessary to either reorder one of them, or to have random access.

But this manifest for the needs to facilitate random access is something we can generate ourselves by scanning the chunks, so it is not the primary driver of this proposal.

--Ray

On 7/13/2021 5:00 AM, John Dziurlaj wrote:

So if I am understanding this correctly, this proposal seeks to have random-access through a set of NIST CVRs, so that a particular CVR or set of CVRs can be easily retrieved?.A secondary goal is to provide a structured means to hold the hashes of each CVR file. Are there any other use-cases of this proposal?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/usnistgov/CastVoteRecords/issues/30#issuecomment-879026726, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSDLSJTDFZMZBXQR5MLEQTTXQTGJANCNFSM47EDGKDA.

JDziurlaj commented 3 years ago

If different snapshots of the same CVR are to be included in separate CVR files, how do you determine which one to process?

raylutz commented 3 years ago

Hi John: It looks like I'm frequently missing these, so sorry for the delay. To respect the requirement of immutable data, it is not allowed to go back and revise a status value. Thus, the records have to be self-describing. Now, you have 'Original' and 'Modified' status. Then, I imagine there might be several 'Modified' entries if several versions have been submitted, say through several rounds of adjudication. It makes sense to complete the initial CVR and lock it down. Then process adjudications. The adjudicator app will be able to read the CVR and all other intervening adjudication files. Then it can see the 'Original' but then I suggest 'Modified' for the first one, and then 'Modified-2', 'Modified-3' etc for subsequent snapshots if there is more than just two. In general, I see this down in other standards by having a count, so instead of "Original" they just use 0, then 1, etc. To maintain the compatibility with any prior use (although there is not much of an installed base) then we can stay with the Original and Modified, but then Modified-2, Modified-3, etc would be used, so that Original is 0, Modified is 1, etc. And if we want to be aggressive about this, then we would make the numeric designation as preferred and deprecate the words.

--Ray