w3c / rdf-ucr

https://w3c.github.io/rdf-ucr/
Other
5 stars 1 forks source link

Cataloguing Use Cases From The National Library Of Sweden #23

Open niklasl opened 1 year ago

niklasl commented 1 year ago

See https://github.com/w3c/rdf-ucr/wiki/RDF%E2%80%90star-for-Annotations-as-Miscellaneous-Marginalia https://github.com/w3c/rdf-ucr/wiki/RDF%E2%80%90star-for-Detailed-Provenance-in-Cooperative-Union-Cataloguing https://github.com/w3c/rdf-ucr/wiki/Describing-a-Union-of-Changes-to-a-Named-Graph for clean versions of scenarios for this collection of use cases.

Contact information

Brief Description of our Use Cases

The National Library of Sweden serves the Swedish cooperative union catalog (Libris), which has different audiences both nationally and internationally. To overcome the silo effect of old technology, and to interoperate with different metadata standards, we have developed a cataloging system based on RDF, using linked vocabularies and datasets.

We have encountered a set of overlapping use cases In this catalog, based on needs for descriptive metadata, and by extension projects and data pipelines depending upon that. We believe that RDF-star may provide an effective means for dealing with these cases.

What we want to be able to do

Why it is hard or impossible to do what we want to do without quoted triples

RDF Statement reification could be used, but is unwieldy, especially in order to keep annotations coordinated with assertions. There is no syntax support for it apart from rdf:ID on predicate elements in RDF/XML.

We use named graphs for effectively working with "record"-sized set of facts, in a single source (our system), commonly about one main entity. But for detailed provenance they are too coarse-grained. Multiple such "records" about the same thing are hard to succinctly display and edit as a combination of description sources. Also, since named graphs have no defined formal semantics (neither what the name denotes, nor what is considered in the union graph of a published dataset), formal interoperability isn't possible today.

Thus, RDF-star annotations appear to fill the gap here, but their semantics remain to be tested in practice.

Various patterns for qualification conflate metadata sources (triples as "occurrences of facts"), logical facts (the statement which has a truth value) and the events or entities that these facts conceptually describe. This is the kind of "creative modelling" that tends to lead to divergent practices and weak interoperability across applications.

If RDF-star semantics can work to clarify and unify design patterns here this would be a major argument in its favor.

What is the role of RDF-star quoted triples in our use cases

  1. Detailed provenance.
  2. Proposed facts.
  3. Additional "marginalia" of detailed or obscure facts that don't fit the fixed shape of a specific application profile.
  4. Aggregate views of historical facts.

How we want quoted triples to behave in our use cases

As far as I can see, referentially transparent (at least for annotations).

We don't need referential opacity for quoted triples since we treat owl:sameAs (and owl:differentFrom if ever used) to be about the reference, not the sense (as in Sense and reference). We are very careful of ingesting data using owl:sameAs because of that, due to the obvious risk for conflation of identity it entails.

In the same way, no opacity is needed to prevent datatype entailment on quoted triples. Any encoded, lexical representation difference is an implementation detail, and not a semantically relevant difference. ("Provenance" here is about "who said what, where", not "how (was it encoded)". The moment a quote occurs in a graph, it is expressed within that context.)

We mainly need "opacity", as in "separate worlds", between graphs, until we deem them truthful and put them in the union graph. Quotation of suggested assertions are enough, we only "let in" owl:sameAs assertions that we are certain of are aliases of the exact same identity. (I'm not sure even these has to be referentially opaque in the linguistic sense; which seems to be supported by Carroll, Bizer, Hayes, Stickler - Named Graphs (2005), notably p. 6 and 7, along with this email.)

Of course, this differs from the view in the CG report, and we need to work out if our use cases would work the same in either interpretation.

Example RDF graphs that shows parts of our use cases

I have added draft scenarios with example data to the wiki:

[EDIT: fixed links broken when pages were moved]

pfps commented 1 year ago

Thanks for the three use cases. I'm doing some analysis and extracting semantic implications from them. I'll add sections to the Wiki pages.

pfps commented 1 year ago

The provenance examples all appear to suffer from using two separate links to create a single relationship - created by X at time Y. Do you want to leave the as is or have them fixed up?

pfps commented 1 year ago

For the provenance use case, does it matter what the form of a literal is as long as the value is the same? That is, would you ascribe the same provenance information to :x :y 1 . and :x y 01. Similarly, suppose that two IRIs denote the same thing in the universe (perhaps using owl:sameAs). Does it matter which one is in the asserted (or quoted) triple?

pfps commented 1 year ago

Similar comments and similar questions apply to the change-log use case.

niklasl commented 1 year ago

@pfps Thank you for the analysis of all scenarios; very helpful comments and questions! I'll reply per separate comment.

niklasl commented 1 year ago

The provenance examples all appear to suffer from using two separate links to create a single relationship - created by X at time Y. Do you want to leave the as is or have them fixed up?

Yes, I see that at least the examples in "Manage Classification Metadata" conflate two properties of one occurrence by putting them on the triple itself. Would you agree that this can be solved by stating one relation (like ex:occurrence or dc:source) from the annotated triple to a node (commonly blank), and put all annotation triples on that node? Like:

<introduction-to-physics> a :Text ;
    bf:classification <literature-education-physics> {|
            dc:source [ bf:assigner <annif> ;
                dc:date "2023-05-20T08:44:06Z" ]
        |} .

To what extent do the rest of the examples suffer from this (e.g. the wikidata examples)? I would like to highlight my poor modelling here (in this issue tracker), so that we can ensure that the RDF-star syntax, behaviour and documentation all come together to avoid this becoming common in the wild. I do think triples need to be types, but how this is effectively used is most important. We don't want to "hit people over the head" with semantics if the syntax leads people too easily astray.

Note that with only one relation (bf:assigner), the error isn't obvious to an untrained observer (putting the quality of the property itself aside). It is also conceivable that such a property could be defined as "an assigner of a known occurrence of a statement", sidestepping the conflation formally. Of course, thad'd make such properties "too flat" for grouping detailed facts together from different sources of observation, so it would arguably be quite poor modelling; but not necessarily wrong. ("Someone's punning is another one's effective use of language.")

So again, yes, we must fix these examples (I'll gladly edit the wiki once we're in agreement); and also capture and learn from the errors so we can prevent them. I tried to make Annotations as Miscellaneous Marginalia more about this tension between "types vs. tokens/occurrences", which seems to be related to the problem of "triples vs. statements vs. events". But these cases overlap so much it was hard to disentangle them (the erroneous ones here are "illustrative" examples there).

RDF-star annotations, which are the most useful form of RDF-star for our use cases, are "dangerous" in that the affordance of the syntax perhaps makes it hard to "see" that you're annotating the statement itself, and not something like the observation that led to the assertion, i.e. the occurrence, or event (act or effect). In other words, I mean "useful" as "concise", in that the annotation syntax pairs assertions with annotations (and multiple pairs of those under one subject, thanks to the already concise Turtle syntax). It is a very powerful form of expression, so we should check its effects thoroughly before "releasing it into the wild".

niklasl commented 1 year ago

For the provenance use case, does it matter what the form of a literal is as long as the value is the same? That is, would you ascribe the same provenance information to :x :y 1 . and :x y 01. Similarly, suppose that two IRIs denote the same thing in the universe (perhaps using owl:sameAs). Does it matter which one is in the asserted (or quoted) triple?

It does not matter. Yes, we would ascribe the same provenance to those two lexical forms, since they represent the same value. Same goes for two IRIs denoting the same thing. We only use owl:sameAs to claim that each assertion using one IRI would hold using the other, including quoted assertions. We want to "quote" the meaning, not the technical way it was said. Only then can we make claims about that meaning. (Cf. "This is not a pipe", which was not was Magritte "actually" said (wrote? painted?). We still (claim to) know what was meant.)

We want full transparency here, as far as I can see, even for this kind of detailed provenance. See the issue description section "How we want quoted triples to behave in our use cases" above for more details on this thinking.

niklasl commented 1 year ago

Similar comments and similar questions apply to the change-log use case.

The same position (full transparency) goes for the union of changes use case (perhaps surprisingly). I believe this holds since the "environment" where this (the algorithm combining the graphs) operates, is "outside of the (temporally constrained) worlds" of the historical assertions (the graphs in the old RDF document versions, no longer in the union of asserted graphs of our published dataset). The algorithm works, in a "closed world", over a dataset of neutral graphs, along with a minimal vocabulary (statedIn, retractedIn) which the algorithm "understands". Perhaps that is a "closed world"; perhaps it is some "private" semantics of named graphs? I find the option to have that quite useful, but I expect it to raise some questions. (For one, I wonder if it is thanks to the lack of semantics for named graphs that this is an "acceptable" way of thinking?)

It is an open question whether or not the result is intended to be used in a union, default graph of our current assertions ("current beliefs"), which we share with the wider world (as published RDF documents). Rather than opting for quoted triple opacity to make these "blame graphs" publishable as is, I'd rather publish them as explicitly named graphs, and claim that they are neutral (likewise for the old versions themselves). But there is no formal way in RDF to do that. JSON-LD comes closest, I think, by stating:

Even though JSON-LD serializes RDF Datasets, it can also be used as a graph source. In that case, a consumer MUST only use the default graph and ignore all named graphs. This allows servers to expose data in languages such as Turtle and JSON-LD using HTTP content negotiation.

(Aside: related behaviour for publishing/consuming named graphs when through other syntaxes where recently raised for RDFLib and Jena, respectively.)

While this may be more about named graphs, it does relate to the question of opacity, or rather contrasts it with named graphs as "enough isolation" (and as I say above in the description, even that doesn't appear to require opacity). And the resolution to this might yield the answer to https://github.com/w3c/rdf-concepts/issues/46. This could clarify that RDF triples, including quoted triples, is only about relating claims in graphs, and that RDF datasets, possibly more formalized in the future, continues to be about operational management of what is actually used/trusted/believed; and what it formally means, depending on chosen entailment regime (and, with neutral graphs, allowing for untrusted data to be part of those operations, if so desired). Practises may relate (such as quoted triples being linked to graph names as sources), but would be explicitly orthogonal. The act of believing, while possible to describe, cannot happen within a graph.