w3c / rdf-canon

RDF Dataset Canonicalization (deliverable of the RCH working group)
https://w3c.github.io/rdf-canon/spec/
Other
13 stars 8 forks source link

First degree hash algorithm and exact specification of hash #37

Closed iherman closed 1 year ago

iherman commented 1 year ago

§4.7.3 (5) and (6) talks about sorted nquads being joined that is then hashed. "Sort" has been discussed; what does "join" mean exactly? Is it identical to, say, javascript's join operation on an array of strings, ie, the lines are concatenated with no space included? Or something else? I think this should be unambigously defined.

iherman commented 1 year ago

B.t.w., I hit this issue because my results are different from the ones listed in the example. I do get the following array for the quads:

[
  '<http://example.com/#p> <http://example.com/#q> _:a .',
  '_:a <http://example.com/#s> <http://example.com/#u> .'
]

which is the properly expanded quads (well, triples) for the (first) example, and I then do a

createHash("sha256").update(data).digest('hex')

using node.js' built-in crypto. And data is the result of array.join(). What I get is

11582ec62171c791cdcd6c7e47181acffa335d47d6a7c4cfdb01d6ce23b0432a

which is different from the one in the spec:

21d1dd5ba21f3dee9d76c0c00c260fa6f5d5d65315099e553026f4828d0dc77a

the difference shows that something may not be unambiguously defined (or I make a stupid mistake...)

dlongley commented 1 year ago

@iherman,

the difference shows that something may not be unambiguously defined (or I make a stupid mistake...)

Or the example hash isn't right :)

But, yes, you should be joining the N-Quads by concatenating them without any spaces. However, it looks like your N-Quads do not have EOL (\n) chars at the end of them. I quickly threw those in there and I get the proper hash when joining the array you provided in JS using array.join(''). IOW, the example hash is right and you just need to add a terminating \n char to each N-Quad.

dlongley commented 1 year ago

So, something else we'll want to clarify is that the EOL needs to be there for every quad.

gkellogg commented 1 year ago

Yes, the hashes were generated with a \n separating quads. In the N-Quads grammar, it's not necessarily apparent that the EOL is part of the serialized quad, but a document must delimit each quad with a newline, and the document must end with a newline.

If you interpret Canonical N-Quads based on the definition for N-Triples, the whitespace within a statement is clearly defined, but interestingly, the whitespace between statements is not defined (it is simply [EOL](https://www.w3.org/TR/n-quads/#grammar-production-EOL) which could contain multiple newlines and carriage returns. I believe the intention is that a canonical N-Quads document (or N-Triples) has a single newline terminating each statement.

The canon document should clarify what is intended here, and clarify that the trailing newline is to be considered part of the quad representation, so that joining can be defined as simple concatenation.

gkellogg commented 1 year ago

This also points towards the need to validate our examples, although the nature of the examples probably makes it difficult to directly use any existing library, as these individual algorithms are not necessarily publicly available.

gkellogg commented 1 year ago

To clarify, hashes are generated by serializing each quad to canonical form, with a single space between elements, and a terminating newline character, and the resulting serialized quads are concatenated together and the results are hashed.

So, from [EXAMPLE 2](), the example quads:

:p :q _:a .
_:a :s :u .

are serialized to N-Quads (applying the default prefix of <http://example.com/#>) to the following:

"<http://example.com/#p> <http://example.com/#q> _:a .\n_:a <http://example.com/#s> <http://example.com/#u> .\n"

and hashed via SHA256 resulting in 21d1dd5ba21f3dee9d76c0c00c260fa6f5d5d65315099e553026f4828d0dc77a.

iherman commented 1 year ago

I am not sure EOL is unambiguous. Is it LineFeed? Is it Carriege Return + LineFeed? Do the \n idiom in JavaScript generate exactly the same set of bytes on MacOS, Windows, Linux and any other systems that will come in the future? Is that the same as the equivalent in Java, C, Lisp, or any other programming language that will come in the future?

I would feel "safer" if, at least for the purpose of C14N we would not use any extra characters in this place. To use the example above, that the hash would be calculated on

"<http://example.com/#p> <http://example.com/#q> _:a ._:a <http://example.com/#s> <http://example.com/#u> ."
afs commented 1 year ago

\n is U+000A.

U+000A is the same as used in XML C14N line ending.

iherman commented 1 year ago

\n is U+000A.

U+000A is the same as used in XML C14N line ending.

Is this true for all programming languages and all systems under the Sun? If that is indeed the case then I am fine leaving \n in, but I am just cautious...

afs commented 1 year ago

Can speak for all PLs :-)

The spec state that "EOL is U+000A".

iherman commented 1 year ago

Can speak for all PLs :-)

The spec state that "EOL is U+000A".

Hm. Which spec?

To muddle the waters, the n-quads spec's grammar says, in the production rules for terminals:

[8]  EOL ::=  [#xD#xA]+

Ie, it may become a source of bugs (or application complication) to require EOL. I believe it is safer not to require EOL characters for hashing.

dlongley commented 1 year ago

Without addressing the argument that "it may be safer not to require EOL characters for hashing", I'll note that making that change would certainly be a breaking one from existing implementations. If it ends up that we also don't make any other breaking changes at the end of the day, then I definitely do not think making such a change would be worth it.

I also think any bugs would be sorted by the test suite -- and that the EOL wars, I think? ... have mostly subsided in recent years, especially with MS's purchase of github, MS tools like "notepad" working with just LF, the more popular and free use of VS code, and so on.

pchampin commented 1 year ago

Mentioned during today's call: https://www.w3.org/2022/11/23-rch-minutes.html#t04

iherman commented 1 year ago

Without addressing the argument that "it may be safer not to require EOL characters for hashing", I'll note that making that change would certainly be a breaking one from existing implementations. If it ends up that we also don't make any other breaking changes at the end of the day, then I definitely do not think making such a change would be worth it.

I take your point @dlongley. However, if we agree that EOL or, to be more precise, a 0x0a character should be in the data, we have to find a way to put this into the spec unambiguously and I think we should add some notes (or other non-normative comment) to warn implementers about possible caveats in their own environment.

I also think any bugs would be sorted by the test suite -- and that the EOL wars, I think? ... have mostly subsided in recent years, especially with MS's purchase of github, MS tools like "notepad" working with just LF, the more popular and free use of VS code, and so on.

I hope you are right. I am just paranoiac 😀


B.t.w., I have made a test with my code adding the \n in, and the generated hash value is now the same as in the spec. Ie, the node.js call

createHash("sha256").update(data).digest('hex')

is indeed the right one...

yamdan commented 1 year ago

Then we might as well define a canonical N-Quads, based on the definition for N-Triples, with an additional restriction as:

ref: https://github.com/w3c/rdf-canon/issues/37#issuecomment-1324019950

gkellogg commented 1 year ago

As we discussed, we need to file an erratum on N-Triples, as it does not describe the need for a 0x000A record terminator for each statement in the document. And, 0x000A (\n) is not part of the canonical form for an individual quad.

There's no question that any generated N-Quads serialization of the canonicalized dataset needs such line terminators. I believe it's required at that step in Hash N-Degree Quads, as well. My implementation does this and passes the test suite, so I'm confused by @dlongley's assertion that they should not be used.

cc/@afs

dlongley commented 1 year ago

@gkellogg,

My implementation does this and passes the test suite, so I'm confused by @dlongley's assertion that they should not be used.

My position is that they should be used -- so our wires got crossed somewhere. So I think we're in agreement.

iherman commented 1 year ago

As we discussed, we need to file an erratum on N-Triples, as it does not describe the need for a 0x000A record terminator for each statement in the document. And, 0x000A (\n) is not part of the canonical form for an individual quad.

👍 but... the reality is that it may take a long time to process the errata and get an updated n-quads spec out (something up in the alley of @pchampin). I believe that, in the meantime, we should also annotate our own spec making these requirements clear.

(Let alone the fact that libraries out there may take even longer to be updated. This whole thing came up because the RDF library I use produces a single nquad without a \n... so a warning is necessary.)

afs commented 1 year ago

@iherman -- this WG could write it's own - but it will be "The RDF Canonicalization N-Quads canonical form". Confusing for library providers, contributors and users to have two canonical form specs. It is framing a separate ecosystem.

it may take a long time

And it may not. That's speculation unless there is evidence.

There is nothing stopping the work being done today, then send it to the RDF errata WG. The spec source is open source. There is a people overlap.

The other WG can produce an updated version and publish it as soon as it's ready. It does not have to be at the end of the WG.

afs commented 1 year ago

that libraries out there may take even longer to be updated

Put in a PR today! For Apache Jena - https://github.com/apache/jena

(open source) libraries are communities. Saying communities will/won't do X isn't the way.

iherman commented 1 year ago

@iherman -- this WG could write it's own - but it will be "The RDF Canonicalization N-Quads canonical form". Confusing for library providers, contributors and users to have two canonical form specs. It is framing a separate ecosystem.

This is not what I said. Of course, the official spec must be the one in the nquads document. What I said is:

As for how long it would take for n-quads to be updated: I doubt there is appetite today to write an RDF maintenance charter, get it through the AC vote, etc, to take care of these changes. Maybe the RDF-Star WG can take care of this, too; I simply do not know. A leave this in the able hands of @pchampin...

afs commented 1 year ago

I doubt there is appetite today to write an RDF maintenance charter

RDF-star WG is already charted for errata on all documents it touches which will have to include N-Quads. It is also setup to be an ongoing WG via W3C process changes to remove the stop-start nature of working groups less of a barrier.

https://www.w3.org/2022/08/rdf-star-wg-charter/

For every recommendation updated by this Working Group(RDF-star WG), the pending editorial errata will also be addressed._ and The Working Group will also consider allowing new features in these recommendations, according to Section 6.3.11.4 of the W3C process, in order to render future evolutions easier.

Between these two, adding NQ canonicization is possible. There is no issue about how strictly something is an erratum or not.

gkellogg commented 1 year ago

I posted to the RDF Comments mailing list: https://lists.w3.org/Archives/Public/public-rdf-comments/2022Nov/0000.html. Presumably, this can be used as the details for RDF 1.1 Errata. I'm not sure the process in that group for accepting and noting an Erratum, but in other group's I'm involved with there hasn't been a formal process for accepting such errata, leaving it to editors and staff.

Certainly, it's within the scope of the RDF-star WG to consider this, as well as other noted errata.

afs commented 1 year ago

There isn't a formal process to accept RDF 1.1 errata - there's no "RDF 1.1 WG" to accept it. The comments list is the place and there is a wiki area. I don't know know of any official responsiblity for the errata wiki - it's been "volunteer-maintained" recently.

Added: https://www.w3.org/2001/sw/wiki/RDF1.1_Errata#erratum_32