qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

Canonical serialization #938

Open Arithmeticus opened 9 months ago

Arithmeticus commented 9 months ago

This issue picks up suggestions from #779 regarding canonical serialization, and solicits from the community group input on if such a function is desirable, and what such a function might look like.

In the context of #779, the idea was that two XML documents with different physical representations, but semantically equivalent, could be serialized to a canonical form, with a hash value applied to each confirming identity. Of course, with canonical operation, a simple string comparison would be sufficient, absent any hashing.

XML Signature was suggested as one approach, with some hesitation. I would like to suggest, instead, that we look to implement Canonical XML Version 1.1 (herein CX1.1), perhaps with map options that calibrate how CX1.1 is implemented. I have no experience using CX1.1, so user input is welcome.

Another point of discussion is whether this merits a new function, e.g., fn:canonical-serialize, or should be built upon fn:serialize. A problem with the latter option, is that such an approach makes no sense without the method option specified as xml. Another approach would be to go deeper, into the serialization spec, and expand the xml method to ensure a canonical option.

I believe that this function would be extremely useful. When preparing test suites, output could be saved as secondary documents as canonical XML, and any subsequent regression tests could adjust comparanda to canonical XML, and very precise node-wise comparisons could be made.

I look forward to everyone's input.

michaelhkay commented 9 months ago

(a) Yes I think it would be useful.

(b) I think my preference would be to use fn:serialize with method="canonical-xml". An alternative is to use method="xml" canonical="yes", but this has the disadvantage that there are many interactions with other serialization options, e.g. indent, cdata-section-elements, and omit-xml-declaration.

Note, if you want to experiment, Saxon already offers `method="xml" saxon:canonical="yes": see https://www.saxonica.com/documentation12/index.html#!extensions/output-extras/serialization-parameters. I tested this against the canonicalizer offered by XOM.

(c) There are certainly users who would want XML Signature for document signing, rather than just canonicalisation.

ChristianGruen commented 9 months ago

I also think this would be useful.

(b) I think my preference would be to use fn:serialize with method="canonical-xml". An alternative is to use method="xml" canonical="yes", but this has the disadvantage that there are many interactions with other serialization options, e.g. indent, cdata-section-elements, and omit-xml-declaration.

+1 for adding a custom method, and we should raise an error if the input is not a single node.

Arithmeticus commented 7 months ago

In thinking about discussion that might happen on this issue at today's CG meeting, I noted to myself:

Ideas for a way forward to avoid repetition in the specs:

I am uncertain what edits might be needed to the serialization specs, particularly for those options in fn:deep-equal that have no fn:serialize counterpart, e.g., schema-aware adjustments, processing-instructions. It may turn out we wish not to include some of these in a common options map. Let's discuss.

adamretter commented 4 months ago

I would be happy with either:

  1. method=xml-c14n
  2. or method=xml canonical=yes

Regards whether to add canonical or not, I guess we should ask if this would be relevant to other methods such as HTML, JSON, or perhaps any future method we might envisage (Yaml, or CSV anyone?)