Can the final canonicalized/hashed dataset/graph contain its own hash?

tkuhn commented 1 year ago

I would like to raise this question / feature request: Do we plan to define the canonicalization/hashing in such a way that would allow for self-describing datasets/graphs that contain their own hash?

In certain circumstances, hashes make great identifiers. And in certain circumstances, making datasets/graphs self-described in the sense that they contain their own metadata/provenance is a good design choice. Taking them together, one sometimes wants to refer to the dataset/graph identifier within the dataset/graph, and if that identifier includes the hash value of the dataset/graph as a whole, then the dataset/graph needs to include its own hash value.

This can be achieved by using a special character like a blank space as a placeholder for the hash within URIs. So, first a preliminary dataset/graph is generated with the blank space appearing in the (proto-)URIs at the positions where the hash is supposed to appear. Then the canonicalization/hashing is performed, and in the end the hash (in a given encoding) is added to the URIs replacing the blanks. The resulting dataset/graph is then used instead of the preliminary one, e.g. published on the Web by the data producer, and the preliminary one is thrown away. Then to validate whether a dataset/graph maps a given hash, one first replaces all occurrences of the hash with a blank space, and then runs the canonicalization/hashing steps. (One could also support hashes in literal values, but it's a bit more complicated and I am not sure it'd be worthwhile.)

We are using this technique with nanopublications. Here you can see an example of a nanopublication that is identified with a hash (RAPcOLUc0B7X-6blgBJyFYTHo3Dz7ti54SXvNW6SpPXm8) calculated on its entire content, including its own identifier: https://np.petapico.org/RAPcOLUc0B7X-6blgBJyFYTHo3Dz7ti54SXvNW6SpPXm8.trig.txt

(This example actually also includes its own digital signature, which is another related issue.)

We introduced these URIs, which we called "Trusty URIs", a while ago in this paper: https://link.springer.com/chapter/10.1007/978-3-319-07443-6_27

I should also note that in our specific case we use the transformation step to also get rid of all blank nodes by Skolemizing them. So, we are not dealing with blank nodes in our own canonicalization/hashing steps, which makes things much simpler, of course. What I propose above might interfere in some ways with the handling of blank nodes we need to do in the scope of this working group. I don't have a full understanding of the ramification of this, and it would surely need further scrutiny.

But I think this ability of datasets/graphs to include their own hash-based identifiers is in general really powerful and useful. But I also see that this would probably introduce quite a bit of additional complexity into this standardization effort on top of the possible interference with blank node handling (e.g. the distinction of a "preliminary" from a "final" dataset/graph, the definition of the meaning of "special characters" like the blank space, the precise encoding scheme to be used for the hash values to be included in the URIs, and more).

I hope this makes sense and would love to hear your opinions.

tkuhn commented 1 year ago

As an addendum: This would of course work very well together with DIDs (which didn't exist back when we proposed Trusty URIs).

iherman commented 1 year ago

I am not sure what the question is.

You describe in the introduction a perfectly valid way of creating a graph with an embedded hash. This WG will provide you with the two basic building blocks: the canonicalization step and the hashing step. What you describe is a way to use these two building blocks to achieve what you do. In other words, it is a layer on top of what the WG is chartered to do.

If what you ask is whether the WG would provide a standard for what you describe, then my answer is: I do not think so, at least it is not in the charter of the group. The WG may decide, however, to produce a WG note with this, and that would be perfectly all right.

Interestingly, the VC specification does something vaguely similar with its proof mechanism, but following a different route using datasets instead of a single graph. In VC, the proof metadata and the "content" (whatever that means) to be hashed are separated. Your application could be based on a dataset instead of a single graph: one graph in the dataset is subject to hashing, whereas the other graph is for metadata that will include the hash value. It requires a specific property to separate these two from one another (which is verifiableCredential in the VC case). Just food for thought...

tkuhn commented 1 year ago

OK, thanks, I think you understood my question correctly (sorry that I wasn't clear enough), and I am glad to hear that the answer seems to be yes. But I am a bit confused now.

My understanding was (perhaps wrongly) that the two building blocks of canonicalization and hashing are meant to be the only two steps to apply (in sequence) to "uniquely and deterministically calculate a hash of RDF Datasets" as written in the Charter. So, the overall workflow so far is RDF Dataset in / hash out, and in between there are these two modules of canonicalizing and hashing that can be chosen/parameterized depending on what we come up with in this Working Group.

Now, in the setting that I am describing above, we would need an additional pre-processing step, so pre-process>canonicalize>hash, where the pre-processing would take the RDF Dataset and the supposed hash to remove the occurrences of the hash in the dataset with a placeholder (like a blank). To create a new hashed dataset, one would then only use the canonicalize>hash part (with an input of an RDF Dataset with placeholders, which is not fully well-formed RDF in the solution described above), whereas for checking an existing hash, one would need the full pre-process>canonicalize>hash. Alternatively, we could move the pre-processing to the canonicalization step, so that step would then need two inputs: the RDF Dataset and the hash. So in any case, there seem to be some small but fundamental adjustments needed to the overall abstract workflow, which in my view falls into the part we should aim to standardize.

Moreover, we would need to make sure the canonicalization and hashing algorithms and implementations are not confused by placeholders in URIs, like blank spaces, rendering this kind of preliminary RDF non-wellformed. RDF4J doesn't complain about such blanks in URIs, but other libraries might.

I hope that makes sense. Or, quite possibly, I am fundamentally misunderstanding something here.

iherman commented 1 year ago

@tkuhn yes, what you describe makes sense and your assumption is correct on the two building block stories (with the minor remark that those two steps, presumably, can be used independently of one another, depending on the use case).

Where you jump ahead, however, is when you say:

So in any case, there seem to be some small but fundamental adjustments needed to the overall abstract workflow, which in my view falls into the part we should aim to standardize.

This makes two assumptions:

the WG decides to take up the use case, whereby an RDF Dataset would "carry" its own hash with it or, to be more precise, would carry the hash of an important subset of the dataset.
provided (1) is a working group work item, to use the official term, the approach you describe is the right one

I am not in the position to answer (1), this is something for our fearless chairs and staff contact to decide upon (@philarcher @peacekeeper @pchampin). This needs discussions.

From a technical point of view, I am also not convinced that your approach is the right one. In general, I think separating the "data" from its "metadata" is a better approach (knowing that the distinction between these two may be fuzzy). A hash of the "data" is "metadata" in this respect, in my view. And the VC approach that I outlined above, ie, to use Datasets (or named graphs, if you prefer that terminology) to separate the core data from the metadata provides a cleaner approach. But that is of course subject to further discussions.

tkuhn commented 1 year ago

Where you jump ahead, however, is when you say:

So in any case, there seem to be some small but fundamental adjustments needed to the overall abstract workflow, which in my view falls into the part we should aim to standardize.

This makes two assumptions:

the WG decides to take up the use case, whereby an RDF Dataset would "carry" its own hash with it or, to be more precise, would carry the hash of an important subset of the dataset.

Absolutely, that's why I phrased it as a question in the issue title. The above sentence was meant to be in the scope of "if the answer to that question is yes then ...".

provided (1) is a working group work item, to use the official term, the approach you describe is the right one

I don't think there is the right way here. I agree that in many situations the separation of data and metadata as you describe it is the best solution, but in other situations it isn't. There are advantages and downsides. For example, if we want to use something like DIDs (or Trusty URIs) to have hashes as identifiers with all their benefits (enforced immutability, verifiability, signatures on top, etc.), then if we separate data and metadata, we need two identifiers: one for the dataset and one for the metadata record. But these only really make sense together, so when we look at one of them we are likely to also want to look at the other. But if we want to use these hash-based identifiers, one can refer to the other, but not in both ways, at least not in a way directly using their main identifiers (otherwise there would be cycle, which is impossible to produce unless one applies the placeholder tricks I described above).

I have practical (and I believe successful) experience with this technique with nanopublications. The main content ("assertion") of such nanopublications are minimal and can consist of just one triple. But the nanopublications also come with provenance and other metadata, referring to the assertion and the nanopublication as a whole using its Trusty URI. Resolving the nanopublication URI gives you all of this and comes with the possibiliby to verify the hash on it. If the data and metadata were fully separated, each nanopublication would need to be split in two, would have two identifers that need to be resolved, two hashes that need to be checked, and the metadata could refer to the data part but not the other direction (or the other way round).

But of course I fully understand that this brings significant additional complexity to this WG work if picked up. I'd be more than happy to put in effort to make this happen if there is a consensus that this would be worthwhile. But I'd also completely understand if this is deemed to be out of scope. In the latter case, it would be nice if we can come up with some kind of note that would give some guidance on how the situation above can be achieved with minimal deviation from the standard.

iherman commented 1 year ago

[…] it would be nice if we can come up with some kind of note that would give some guidance on how the situation above can be achieved with minimal deviation from the standard.

I am a bit out of place, because I am neither chair nor staff contact (despite being part of the W3C staff), only talking as a lambda WG member: it is perfectly possible for a WG to publish a Working Group Note on any related subject. It is only a matter of the WG agreeing with it and having the people willing to pick up the work...

dlongley commented 1 year ago

The canonicalization algorithm can be used, in part, to produce an identifier for a dataset, however, you could not then insert such an identifier into the dataset and expect the canonicalization algorithm not to change as a result of that.

philarcher commented 1 year ago

Discussed 2023-05-24. Consensus was that as Dave wrote above. The issue can be closed.

w3c / rdf-canon

Can the final canonicalized/hashed dataset/graph contain its own hash? #54