serialization, canonicalized values and normalized datasets

nightpool commented 5 years ago

Hey all, someone in the ActivityPub community was attempting to implement JSON-LD signatures recently, and they came across this issue:

https://socialhub.activitypub.rocks/t/making-sense-of-rsasignature2017/347

None of the specs say anything about the ordering of the RDF quads. The JSON-LD spec doesn’t mention it but the test suite page tells you that you need to compare them using “RDF isomorphism” which, if I understood all that type theory mumbo-jumbo correctly (gosh, it feels like making a Telegram client from scratch again), means you have to disregard the order which is what I do for my tests. The URDNA2015 spec doesn’t say anything about the ordering either. But For each quad, quad, in input dataset: as the last step, even though you sort things internally, kinda implies that you’re supposed to keep the order of the original dataset. The test suite page doesn’t say anything about ordering either.

As I understand the specs in question, URDNA2015's Normalization Algorithm returns a normalized dataset, which elides the problem of serialization. However, the ld-signatures algorithm seems to be expecting URDNA2015 to return a canonicalized value, which is a serialization of a normalized dataset, presumably specifying ordering and things like that.

Here's a quote from the rdf-normalization spec discussing this very issue:

This specification defines a normalized dataset to include stable identifiers for blank nodes, but practical uses of this will always generate a canonical serialization of such a dataset

What rules should implementations use to turn URDNA2015's normalized dataset into a canonical serialized value? ruby's rdf-normalize gem uses the following code:

https://github.com/ruby-rdf/rdf-normalize/blob/develop/lib/rdf/normalize/writer.rb#L56-L63

statements = RDF::Normalize.new(@repo, @options).
  statements.
  reject(&:variable?).
  map {|s| format_statement(s)}.
  sort.
  each do |line|
    puts line
  end

Does the .sort here correspond to any spec?

gkellogg commented 5 years ago

It's true that quads are unordered, and that the property way to compare datasets is through isomophism, which comes from RDF Concepts. But, RDF Dataset Canonicalization effectively does the same thing. Blank Nodes are named such that a straight sort of the output quads can be used to compare two datasets, but simply (sic) running them both through C14N.

When serializing a canonicalized dataset, the resulting N-Quads should be lexicographically sorted, and generated in canonical form as described in N-Triples, on which N-Quads is based.

nightpool commented 5 years ago

When serializing a canonicalized dataset, the resulting N-Quads should be lexicographically sorted, and generated in canonical form as described in N-Triples, on which N-Quads is based.

Is this spec'd anywhere? If not, how am I supposed to make sure my SHA256 hash of the dataset (as defined and used by ld-signatures) is interoperable with someone else's SHA256 hash of the dataset?

grishka commented 5 years ago

Yeah, that was the exact question I had.

To add to that: I can't seem to figure out what am I doing wrong in URDNA2015. It assigns wrong identifiers on non-trivial datasets but I seem to have implemented everything exactly as the spec says.

Снимок экрана 2019-11-26 в 16 19 22

The description of the "Hash N-Degree Quads" algorithm has the following step:

Replace issuer, by reference, with chosen issuer.

The issuer variable is an input to this algorithm. Does that mean I have to replace it outside of the function too, like the way C++ & references work?

edit: upon looking closer at both places where the algorithm is called, they both use the issuer returned in the result and don't do anything with the one passed to it.

edit2: nvm, I found a self-contained python library that implements all that correctly, I'm going test mine against it comparing the results of all the intermediate steps.

gkellogg commented 5 years ago

The algorithm defines a specialization of an RDF Dataset, which is inherently an unordered structure, as graphs do not have any inherent order (which is why a List/Collection is a necessary structure to provide ordering). As noted, there is a non-normative note that practically, this will involve serializing the dataset to N-Quads, but does not specify precisely how to do this, which is outside of the scope of the algorithm.

There's a gap between the normalization and LD Signatures specs where the specifics of serialization need to be specified. Either RDF Dataset Normalization needs to normatively define a serialization and LD Signatures needs to refer to it, or LD Signatures needs to normatively specify how to serialize a dataset to be signed.

Note that this is a draft specification, and is expected to be taken up by a future Working Group.

dlongley commented 5 years ago

@nightpool,

Does the .sort here correspond to any spec?

Is this spec'd anywhere? If not, how am I supposed to make sure my SHA256 hash of the dataset (as defined and used by ld-signatures) is interoperable with someone else's SHA256 hash of the dataset?

This is under-specified -- thank you for finding it. The issue here is that a concrete serialization of the canonical dataset using canonical N-Quads needs to be specified, including the fact that the quads must be lexicographically ordered. All of the current implementations sort the N-Quads so the same hash will be produced.

I'll file a bug on the need to specify this in the RDF Dataset Normalization spec.

nightpool commented 5 years ago

There's a gap between the normalization and LD Signatures specs where the specifics of serialization need to be specified

yes, this is what I said in my original issue :D glad we could reach consensus.

I'll file a bug on the need to specify this in the RDF Dataset Normalization spec.

sounds good!

grishka commented 5 years ago

That python library did help, let me say what else is missing from the RDF dataset normalization spec.

The "Hash First Degree Quads" algorithm does not specify how exactly you're supposed to join the serialized strings before hashing. I interpreted the spec as having to just concatenate them together. The actual expected way of doing that is to join them with newline characters, and then append a newline character at the end of the resulting string, like this:

return hash(String.join("\n", nquads)+'\n');

I then had one more bug to fix (I didn't notice one check in "Hash N-Degree Quads") and now all the tests pass.

msporny commented 1 year ago

I'm trying to figure out if there is more that we need to do here, @gkellogg or @dlongley?

We do have better algorithms defined in the spec now:

https://w3c.github.io/vc-data-integrity/#algorithms

and specific ones in the cryptosuites that speak to use of RDF canonicalization:

https://w3c.github.io/vc-di-eddsa/#algorithms

... but we also know that the algorithms are still a bit hard to read and piece together.

I'm going to keep this issue open until we feel like we're at feature freeze w/ the algorithms (next couple of months). I expect that we'll also be providing an option to do canonicalization via JSON Canonicalization Scheme (JCS), which might help the Mastodon/fediverse folks out with an easier to implement option for signing ActivityPub messages.

gkellogg commented 1 year ago

Specifying a canonical serialization of a normalized dataset, while trivial, is still an open item in RDF RCH, as is a mechanism for selective disclosure. It still isn’t decided if it goes in the rdf-canon spec or a new spec.

it seems to me that the pace of the RCH WG is a bit slow, and other issues have been prioritized in bi-weekly discussions. For a straight serialization of a normalized dataset, additional text would be easy enough; just waiting on direction from the WG.

msporny commented 1 year ago

I expect that we'll also be providing an option to do canonicalization via JSON Canonicalization Scheme (JCS), which might help the Mastodon/fediverse folks out with an easier to implement option for signing ActivityPub messages.

This option exists now for the Mastodon/fediverse folks... both for EdDSA and ECDSA:

We have multiple fediverse implementers that are looking at implementing jcs-eddsa-2022 (which doesn't use RDF Dataset Canonicalization, and uses JSON Canonicalization Scheme instead).

The RDF Dataset Canonicalization spec now also includes the language described in this issue as well:

https://w3c.github.io/rdf-canon/spec/

So, I believe we've done what we can here. I'm going to mark this as pending close, and the issue will be closed in 7 days unless someone disagrees that we've addressed the issue in the current set of specifications.

w3c / vc-data-integrity

serialization, canonicalized values and normalized datasets #11