Closed nightpool closed 1 year ago
It's true that quads are unordered, and that the property way to compare datasets is through isomophism, which comes from RDF Concepts. But, RDF Dataset Canonicalization effectively does the same thing. Blank Nodes are named such that a straight sort of the output quads can be used to compare two datasets, but simply (sic) running them both through C14N.
When serializing a canonicalized dataset, the resulting N-Quads should be lexicographically sorted, and generated in canonical form as described in N-Triples, on which N-Quads is based.
When serializing a canonicalized dataset, the resulting N-Quads should be lexicographically sorted, and generated in canonical form as described in N-Triples, on which N-Quads is based.
Is this spec'd anywhere? If not, how am I supposed to make sure my SHA256 hash of the dataset (as defined and used by ld-signatures) is interoperable with someone else's SHA256 hash of the dataset?
Yeah, that was the exact question I had.
To add to that: I can't seem to figure out what am I doing wrong in URDNA2015. It assigns wrong identifiers on non-trivial datasets but I seem to have implemented everything exactly as the spec says.
The description of the "Hash N-Degree Quads" algorithm has the following step:
Replace issuer, by reference, with chosen issuer.
The issuer variable is an input to this algorithm. Does that mean I have to replace it outside of the function too, like the way C++ &
references work?
edit: upon looking closer at both places where the algorithm is called, they both use the issuer returned in the result and don't do anything with the one passed to it.
edit2: nvm, I found a self-contained python library that implements all that correctly, I'm going test mine against it comparing the results of all the intermediate steps.
The algorithm defines a specialization of an RDF Dataset, which is inherently an unordered structure, as graphs do not have any inherent order (which is why a List/Collection is a necessary structure to provide ordering). As noted, there is a non-normative note that practically, this will involve serializing the dataset to N-Quads, but does not specify precisely how to do this, which is outside of the scope of the algorithm.
There's a gap between the normalization and LD Signatures specs where the specifics of serialization need to be specified. Either RDF Dataset Normalization needs to normatively define a serialization and LD Signatures needs to refer to it, or LD Signatures needs to normatively specify how to serialize a dataset to be signed.
Note that this is a draft specification, and is expected to be taken up by a future Working Group.
@nightpool,
Does the .sort here correspond to any spec?
Is this spec'd anywhere? If not, how am I supposed to make sure my SHA256 hash of the dataset (as defined and used by ld-signatures) is interoperable with someone else's SHA256 hash of the dataset?
This is under-specified -- thank you for finding it. The issue here is that a concrete serialization of the canonical dataset using canonical N-Quads needs to be specified, including the fact that the quads must be lexicographically ordered. All of the current implementations sort the N-Quads so the same hash will be produced.
I'll file a bug on the need to specify this in the RDF Dataset Normalization spec.
There's a gap between the normalization and LD Signatures specs where the specifics of serialization need to be specified
yes, this is what I said in my original issue :D glad we could reach consensus.
I'll file a bug on the need to specify this in the RDF Dataset Normalization spec.
sounds good!
That python library did help, let me say what else is missing from the RDF dataset normalization spec.
The "Hash First Degree Quads" algorithm does not specify how exactly you're supposed to join the serialized strings before hashing. I interpreted the spec as having to just concatenate them together. The actual expected way of doing that is to join them with newline characters, and then append a newline character at the end of the resulting string, like this:
return hash(String.join("\n", nquads)+'\n');
I then had one more bug to fix (I didn't notice one check in "Hash N-Degree Quads") and now all the tests pass.
I'm trying to figure out if there is more that we need to do here, @gkellogg or @dlongley?
We do have better algorithms defined in the spec now:
https://w3c.github.io/vc-data-integrity/#algorithms
and specific ones in the cryptosuites that speak to use of RDF canonicalization:
https://w3c.github.io/vc-di-eddsa/#algorithms
... but we also know that the algorithms are still a bit hard to read and piece together.
I'm going to keep this issue open until we feel like we're at feature freeze w/ the algorithms (next couple of months). I expect that we'll also be providing an option to do canonicalization via JSON Canonicalization Scheme (JCS), which might help the Mastodon/fediverse folks out with an easier to implement option for signing ActivityPub messages.
Specifying a canonical serialization of a normalized dataset, while trivial, is still an open item in RDF RCH, as is a mechanism for selective disclosure. It still isn’t decided if it goes in the rdf-canon spec or a new spec.
it seems to me that the pace of the RCH WG is a bit slow, and other issues have been prioritized in bi-weekly discussions. For a straight serialization of a normalized dataset, additional text would be easy enough; just waiting on direction from the WG.
I expect that we'll also be providing an option to do canonicalization via JSON Canonicalization Scheme (JCS), which might help the Mastodon/fediverse folks out with an easier to implement option for signing ActivityPub messages.
This option exists now for the Mastodon/fediverse folks... both for EdDSA and ECDSA:
We have multiple fediverse implementers that are looking at implementing jcs-eddsa-2022 (which doesn't use RDF Dataset Canonicalization, and uses JSON Canonicalization Scheme instead).
The RDF Dataset Canonicalization spec now also includes the language described in this issue as well:
https://w3c.github.io/rdf-canon/spec/
So, I believe we've done what we can here. I'm going to mark this as pending close, and the issue will be closed in 7 days unless someone disagrees that we've addressed the issue in the current set of specifications.
Hey all, someone in the ActivityPub community was attempting to implement JSON-LD signatures recently, and they came across this issue:
https://socialhub.activitypub.rocks/t/making-sense-of-rsasignature2017/347
As I understand the specs in question, URDNA2015's Normalization Algorithm returns a normalized dataset, which elides the problem of serialization. However, the ld-signatures algorithm seems to be expecting URDNA2015 to return a canonicalized value, which is a serialization of a normalized dataset, presumably specifying ordering and things like that.
Here's a quote from the rdf-normalization spec discussing this very issue:
What rules should implementations use to turn URDNA2015's normalized dataset into a canonical serialized value? ruby's rdf-normalize gem uses the following code:
https://github.com/ruby-rdf/rdf-normalize/blob/develop/lib/rdf/normalize/writer.rb#L56-L63
Does the
.sort
here correspond to any spec?