usnistgov / CastVoteRecords

Common data format specification for cast vote records
https://pages.nist.gov/CastVoteRecords
Other
20 stars 2 forks source link

Deprecate XML, make JSON the recommended format. #29

Open raylutz opened 3 years ago

raylutz commented 3 years ago

Organization Name: Citizens Oversight

Organization Type: NGO Nonprofit, Developer of AuditEngine development platform

Document (e.g., CastVoteRecords): CVR

Reference (Include section and paragraph number): General

Comment (Include rationale for comment): The two formats, XML and JSON are not very different in structure. JSON has now become the industry leader while XML is no longer recommended. It is senseless to support two formats. Writers of CVRs should have one format. If they are currently writing XML, then it can be easily converted. At this stage in the standard adoption curve, there is no rationale to support two formats.

Suggested Change: Deprecate XML and make JSON the recommended format

Organization Type: 1 = Federal, 2 = Industry, 3 = Academia, 4 = Self, 5 = Other

jungshadow commented 3 years ago

Thanks for the thoughts, @raylutz! The current landscape largely dictates the formats that need to be supported. Many existing systems have built-in libraries that process XML, so, in order to allow these system to support the CDFs, the formats we produce have to be both accommodating and forward-thinking. The goal of this work is to increase adoption while improving the overall ecosystem. The main thing I focus on is that the standard is the model, which allows us support any number of legacy or future formats as time goes on. If the industry starts to coalesce around a single format, we may be able to sunset one format or another, but that may take some time. On the next call, we could ask the existing system providers about which format(s) they currently use to get a read on what's out there.

raylutz commented 3 years ago

@jungshadow -- Deprecating means that JSON will be the recommended standard for future implementations and XML is not recommended. There should be at least a recommended format for best interoperability. There are VERY FEW "existing systems that have built in libraries" because this landscape is actually dominated by just a very few companies and very few such existing libraries. And I agree that you need a big library to process XML because it is a pesky beast with all sort of strange issues. I believe that JSON provides a very lightweight data encoding standard that is both human readable and easy to parse. It does not require extensive libraries to parse, and this is the good thing about it, and is why the adoption of JSON now outstrips support for XML due to the recognition that at the level of data interoperability, it is a wise choice..

benadida commented 3 years ago

+1 to @raylutz's comment:

I don't see the benefit of keeping both, and I don't see the practical benefit of XML over JSON at this point.

JDziurlaj commented 3 years ago

As someone who has worked extensively on CDF implementations across many jurisdictions, I will say that there are strong headwinds in continuing to use XML. This reflects a desire to maintain existing infrastructure and skill-sets in election offices. Many relational database vendors have only recently added support for JSON (if at all), and not all jurisdictions are on these versions. Additionally, many development frameworks have very mature tooling for XML, and those frameworks are heavily used by government and industry alike (e.g. .NET and Java).

I know that these frameworks and XML itself may not be seen as "cutting edge", but we must provide an accessible on-ramp to election officials and vendors. Counterintuitively, this is XML.

jungshadow commented 3 years ago

Regarding the existing infrastructure and skill sets, that's been my experience, too, @JDziurlaj.

raylutz commented 3 years ago

Please, @JDziurlaj, what are the "many CDF implementations" regarding the cast-vote-record standards? We have a big problem already in that the standard is hardly adopted at all. So really saying there is an installed base we must respect does not square with any reality I am aware of. But certainly you have been on the leading edge. So I am interested in the many implementations, because the only one I know of that even comes close is Dominion and they use JSON.

JDziurlaj commented 3 years ago

Hi @raylutz. I was speaking to my general experience with CDF development. As to my experience with the CVR CDF in particular, the Universal Rank Choice Voting Tabulator consumes the CVR CDF in both JSON and XML. Additionally, I did some testing with an manufacturer that was producing XML CVRs.

raylutz commented 3 years ago

Alright, well then @JDziurlaj, this points out why I dislike using github issues threads. Here we have an issue thread about CastVoteRecords. My comments are specific to that. But you may be making a reasonable point about the larger set of records that are regarding, say results reporting, etc. and thus are not impacted by the issue of the verbosity of XML. CVRs are generally very large, with 100Ks or millions of records. There is really minimal deployment of anything and it is JSON, in terms of real voting equipment. Thus I believe no case has yet been made to keep XML.

JDziurlaj commented 3 years ago

@raylutz I am removing the breaking tag given your explanation of deprecated.

benadida commented 3 years ago

@JDziurlaj @jungshadow I would strongly recommend itemizing the use cases explicitly before continuing to commit to XML.

Here's my use case: we're a voting machine vendor, and we've had conversations with other players in this space where we assumed they would be using XML even as we prefer JSON. Turns out, even with long-established codebases running on a .NET platform where, we agree, XML tooling is well established, folks we've worked with have preferred JSON unanimously. We have yet to encounter one that prefers XML. We have yet to encounter a practical reason for XML.

I would also ask for examples of election administrators wanting XML for CVRs. In our work on audits, election officials have overwhelmingly preferred CSV. CSV is not rich enough, we agree, for the full CVR data structure, so JSON is obviously more appropriate. But, we've yet to work with any election administrators who wanted XML.

So, my suggestion would be, for this new standard, to drive with specific use cases. The significant cost of supporting two serializations for the same model, with all of the security implications of parsing untrusted content, is high. So the benefit should be explicit and specific.

JDziurlaj commented 3 years ago

@benadida, when we consider support for various output serializations we need to consider all CDFs, not just Cast Vote Records. When it comes to use-cases, there are probably three I would consider:

  1. the need for bonafide markup support,
  2. the need for strong metadata capabilities (via attributes), and
  3. attachment to the XML ecosystem.

For (1), currently no CDF uses the markup capabilities of XML. This may change with the proposed Ballot Styles CDF, depending on the direction it takes. For (2), there are some places where we do use attributes in XML, such as for Object Ids, MIME types, file names, among others. I could see metadata needs increasing as vendors work more with the files and need to convey things such as data linage. For (3), we have found almost all vendors to be using XML, and many use XML-centric technologies, such as XSLT, to perform data mapping (NB: XSLT3 now supports JSON as well). Finally, VIP, the most used (non-NIST) CDF in the election space, uses XML exclusively. Election Results Reporting (1500-100) and VIP are very closely related (through frequent collaboration of the two teams), and to deprecate XML would risk losing synergies between them.

I am certainly sensitive to the complexities of supporting multiple serializations of the CDFs. I have even done some work to explore ways vendors can use mechanical transformations to convert between JSON and XML.

raylutz commented 3 years ago

John:

I would like to further assert that only of JSON or XML be used, because the format is very similar. In other words, both are extremely verbose structured data approaches, and are very inefficient and difficult to work with for many reasons. I will have other proposals for flat CVR format, which is also a direct conversion from JSON or XML, but provides useful structural features, such as: Easy to sum all the columns and create subtotals over any groups, compatible with distributed processing, and provides information about all marks on the ballot (not just those that are later deemed "votes"). The JSON or XML CVR is extremely inefficient in terms of size, which does become a factor in larger jurisdictions. The CVR as it stands is "lossy" in that converting to it loses information (not all marks are represented, only those deemed votes, and if an overvote occurs, we don't know what was marked), and will likely not be our favorite as a result. Nevertheless, there is not much difference between XML and JSON except that these days, JSON is preferred because it has fewer degrees of freedom, and therefore results in better compatibility. Therefore, I suggest again that we deprecate the XML format, and further, I suggest you do the same across all similar CDF standards.

--Ray

On 7/13/2021 9:57 AM, John Dziurlaj wrote:

@benadida https://github.com/benadida, when we consider support for various output serializations we need to consider all CDFs, not just Cast Vote Records. When it comes to use-cases, there are probably three I would consider:

  1. the need for bonafide markup support,
  2. the need for strong metadata capabilities (via attributes), and
  3. attachment to the XML ecosystem.

For (1), currently no CDF uses the markup capabilities of XML. This may change with the proposed Ballot Styles CDF, depending on the direction it takes. For (2), there are some places where we do use attributes in XML, such as for Object Ids, MIME types, file names, among others. I could see metadata needs increasing as vendors work more with the files and need to convey things such as data linage. For (3), we have found almost all vendors to be using XML, and many use XML-centric technologies, such as XSLT, to perform data mapping (NB: XSLT3 now supports JSON as well). Finally, VIP, the most used (non-NIST) CDF in the election space, uses XML exclusively. Election Results Reporting (1500-100) and VIP are very closely related (through frequent collaboration of the two teams), and to deprecate XML would risk losing synergies between them.

I am certainly sensitive to the complexities of supporting multiple serializations of the CDFs. I have even done some work https://github.com/HiltonRoscoe/CDFPrototype/blob/master/conversion/format_conversion.md#converting-between-physical-formats to explore ways vendors can use mechanical transformations to convert between JSON and XML.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/usnistgov/CastVoteRecords/issues/29#issuecomment-879249617, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSDLSID2TP2UHTNWRRZHGTTXRWBNANCNFSM45FXAYXQ.