w3c / rdf-ucr

https://w3c.github.io/rdf-ucr/
Other
5 stars 1 forks source link

RDF-star for labelled property graphs #16

Open pfps opened 1 year ago

pfps commented 1 year ago

See https://github.com/w3c/rdf-ucr/wiki/RDF-star-for-labelled-property-graphs for the current status of this use case.

Taken from https://github.com/w3c/rdf-star/issues/33

** Brief Description of your use case:

As a KG vendor, we want Stardog customers to have easy to use means to attach properties to edges in their RDF graph or load property graph data with edge properties. Here "easy" specifically means that neither the customer nor the database should have to wreck the data model (and queries) to use any of the workarounds available in plain RDF for that purpose (like the RDF reification).

*** What you want to be able to do:

We want to be able to easily assert properties on edges and query them using SPARQL.

Also we want to enable customers to store that annotated statement in any named graph they want so we don't want to use named graphs for representing statement-level metadata.

*** What is the role of RDF-star quoted triples in your use case:

RDF quoted triples will be the subject of properties on edges.

*** Why it is hard or impossible to do what you want to do without quoted triples:

Using RDF reification or other approaches requires changes to the data model and, particularly, complex SPARQL queries to retrieve the data.

Regarding named graphs, there's a very simple argument why we want to keep both annotated triples and named graphs. We regularly see people wondering if they should manage different parts of their data in i) separate datasets (i.e. separate physical databases inside a server instance) ii) separate named graphs inside one dataset. There are pros and cons to both. Sometimes the choice isn't clear.

So far they've been able to just take data stored in the default graph of database X and move it into a named graph inside Y. Importantly, they won't need to change queries (or apps), they only need a connection string to a different database and a different query dataset (ie. FROM in SPARQL). The latter can be defined outside of queries as defined in the SPARQL Protocol. Now, we don't want RDF-star to limit that flexibility: if you want to take a bunch of triples with annotations and move them into a named graph, that should be similarly easy.

*** How you want quoted triples to behave in your use case:

*** An example RDF graph that shows part of your use case:

For example, if the customer has :pavel :worksAt :Stardog edge in the data and wants to add ... :since 2011 to it, neither they nor the database should have to transform it into a bunch of different triples like [] rdf:subject :pavel ; rdfs:predicate :worksAt ... (and then also rewrite queries so that ?s :worksAt :Stardog still returns :pavel).

As a further example we want to be able to have data like

<< :a skos:closeMatch :b >> :score 0.9

and queries like


?x a :Type1 .
?y a :Type2 .
<< ?x skos:closeMatch ?y >> :score ?score
``
with subsequent filtering and aggregation on ?score.
ericprud commented 1 year ago

For example, if the customer has :pavel :worksAt :Stardog edge in the data and wants to add ... :since 2011 to it, neither they nor the database should have to transform it into a bunch of different triples like [] rdf:subject :pavel ; rdfs:predicate :worksAt ... (and then also rewrite queries so that ?s :worksAt :Stardog still returns :pavel).

I think this is asking RDF-star to solve an apparent-non-monotonicity problem that property graphs routinely gloss over. @pchamin pointed out that the appropriate answer to the ?s :worksAt :Stardog query appears to change when you add another qualifier: ... :until 2019 (EHAVEEATCAKE error?).

Property graph architectures typically manage this by a combination of retracting assertions, putting up with awkward temporal models, and having no negative qualifiers. For example, there's a stated in (P248) but no debunked/disputed by in wikidata citation properties. This pushes negative assertions up into the trivially-queriable ("truthy" in wikidata parlance) graph.

You could improve/complicate modeling by adding a negates bit to annotations and have trivial SPARQL queries filter against it. There would still be marginal cases, but at least the person adding ... :until 2019 could (remember to) click the negates box.

Alternatively, everyone could maintain their own list of negative annotation predicates, which would be doable for wikidata-like use cases where the annotation predicates (or at least the negating ones) are centrally-controled. Either of these changes the current semantics of queries, be they SPARQL, DL, or some triplesMatching API (rdf-star-plus-plus).

Note that the wontfix solution has a similar impact by telling users that they have to (remember to) write carefully-qualified SPARQL/DL/triplesMatching queries if there might be some annotation which expires or negates the assertion.

You could try to deal with expires and negates differently since in the former case the assertion was true at some point and in the latter case it probably never was but we just didn't know it.

ericprud commented 1 year ago

oops, didn't notice this was a UC&R list. should I move these comments back to https://github.com/w3c/rdf-star/issues/33 ?

rat10 commented 1 year ago

@ericprud

I think this is asking RDF-star to solve an apparent-non-monotonicity problem that property graphs routinely gloss over. @pchampin in pointed out that the appropriate answer to the ?s :worksAt :Stardog query appears to change when you add another qualifier: ... :until 2019 (EHAVEEATCAKE error?).

In my understanding @pchampin didn't imply the problems with monotonicity that you refer to, rather to the contrary. But let me phrase it my way: the triple stating that Pavel works at Stardog is atemporal and insofar is true if Pavel works or worked at Stardog at any time now or in the past (or maybe even in the future, if we are very sure about that future). Further detail, added through annotation and retrieved through extra querying/filtering, doesn't change that. In that sense annotating triples is monotonic: it just adds more detail. Adding ever more detail is the most basic thing we do on the semantic web. Every detail added reduces the number of possible worlds that the global RDF graph describes. That is not a non-monotonic activity.

What would be non-monotonic is if we outright rejected a statement in the annotation. If that was possible we would indeed have to check for every result if it's negated box is checked. But we are in a not too different situation already with RDF-star: for every annotation we have to check if the annotated triple is indeed asserted. Granted, normally we work the other way round and check if a triple we already know about also has some annotations attached, and that's much less troublesome.

("truthy" in wikidata parlance)

I have to read up on that and maybe I don't properly understand what you refer to, but isn't almost anything we can say only truthy? To practically every fact we assert, more detail could be added. Often we can't even be totally sure. All that hasn't brought the semantic web down.

EDIT: I read up on truthyness and I guess I do now get your point. There is still the unasserted quoted triple to express statements that we want to document but not assert. I guess that's practically close enough to negated statements in most cases and doesn't jeopardize monotonicity.

rat10 commented 1 year ago

The use case description says:

  1. In LPGs there can be multiple edges with the same start node, end node, and relationship type.

and

Capturing the second difference requires a fundamental change in RDF from RDF graphs being sets of triples to RDF graphs being bags of triples.

RDF standard reification gets around this problem by not reifying the statement type but describing an occurrence/instance of that type. There is no direct connection between the two: the subject of the reification quad stands for some speech act, so to say. The RDF-star CG report takes a similar route, but makes it optional to annotate an occurrence/instance or the type itself. The latter, annotating the type, is impossible in RDF standard reification. However, both approaches provide no direct connection between a stated triple occurrence/instance and annotations on it. The annotations are either made on all statements of that type or on some instance of it.

In sharp contrast to that approach, singleton properties create a new type of property which is a subproperty of the intended relation, and thereby get around the limitations imposed by the set semantics. So singleton properties provide a hint at how this problem could be solved without requiring a fundamental change from set to bag semantics, right? It is a stretch, but if done right it could work well both on the surface and underground. The basic idea is that each singleton property is just a link between the statement and its annotations (the most important being which property the singleton property is a subproperty of). If we replace that link by some syntactic sugar, nothing of particular value is lost. Indeed, if blank nodes were allowed in predicate position, the whole singleton property approach would be a very natural way to model n-ary relations.

Long prolegomenon, I know, but see how through these eyes the following RDF-star shortcut syntax

:a :b :c {| :d 1 |} .
:a :b :c {| :e 1 |} .

is just syntactic sugar for

:a :singeltonPropertyOf_b_1 :c ;
   :singeltonPropertyOf_b_2 :c.
:singeltonPropertyOf_b_1 :d 1.
:singeltonPropertyOf_b_2 :e 1.

This certainly doesn't break set semantics and on the surface it captures the intuitions of users. What it doesn't allow is to annotate the type itself, but that purpose can still be served by the unabbreviated syntax.

To make this more similiar to the unabbreviated syntax lets replace the singleton property names by occurrences. Remember the CG report example (which for some reason omits to assert :s :p :o):

_:a :occurrenceOf << :s :p :o >> ;
    :in <file1.ttl> ;
    dct:creator :alice.
_:b :occurrenceOf << :s :p :o >> ;
    :in <file2.ttl> ;
    dct:creator :bob.

The same in shortcut syntax (but :s :p :o is actually asserted):

:s :p :o {| :in <file1.ttl> ; dct:creator :alice |}.
:s :p :o {| :in <file2.ttl> ; dct:creator :bob |}.

The same in singleton property-turned-occurrence syntax:

:s :occurrenceOf_p_1 :o.
:occurrenceOf_p_1  :propertyOccurrenceOf :p;
      :in <file1.ttl> ; 
      dct:creator :alice.
:s :occurrenceOf_p_2 :o.
:occurrenceOf_p_2  :propertyOccurrenceOf :p;
      :in <file2.ttl> ; 
      dct:creator :bob.
ericprud commented 1 year ago

tl;dr: don't use a temporal assertion in the UC&R; it will mislead readers about good modeling and rdf-star utility in general.

@rat10

the triple stating that Pavel works at Stardog is atemporal and insofar is true if Pavel works or worked at Stardog at any time now or in the past (or maybe even in the future, if we are very sure about that future). Further detail, added through annotation and retrieved through extra querying/filtering, doesn't change that. In that sense annotating triples is monotonic: it just adds more detail.

Agreed; I intended to convey that by calling it "apparent-non-monotonicity" in that your average user would not think of your caveat and add:

… MINUS { << ?employee :worksAt :Stardog >> :until ?expiry }

before sending out XMass bonus checks.

I believe standard reification provides a half-way solution to this human-engineering problem in that the query

?employee :worksAt :Stardog

gets no results, forcing them to work harder structuring it as a reified statement, which may lead them to think "someone had a reason to reify this" and prowl around in the docs or the graph before writing:

SELECT ?recipient {
  ?statement
    rdf:subject ?recipient ;
    rdf:predicate :worksAt ;
    rdf:object :Stardog .
  } MINUS { ?statement :until ?expiry }

I think that a subproperty of :worksAt is similarly a half-way solution in that the ?employee :worksAt :Stardog query still returns Pavel, but at least anyone using the subproperty might stop to think about caveats and look for clauses to include in their query.

IMO, the :worksAt assertion is attractive by misleading. A better model would be to "reify" the assertion into a domain-specific model that leads queriers in the right direction:

:Employment123
  :employed :Pavel ;
  :starting "2011"^^xsd:year ;
  :ending "2019"^^xsd:year .

For the UC&R, I'd steer towards invariants like:

<< :wrinklySkin :heritableIn :Pisum_sativum >> :positedBy :Gregor_Mendel .

(Though honestly even that was the result of some cherry-picking and has lead centuries of scientists to see Mendelian behavior where the real causality is more nuanced. Still, good enough to give readers the idea.)

pfps commented 1 year ago

oops, didn't notice this was a UC&R list. should I move these comments back to w3c/rdf-star#33 ?

No, this is the place for discussion of the issue. There is a separate place (currently the Wiki page) for a clean description of the issue.

pfps commented 1 year ago

@rat10 The shorthand syntax doesn't do this occurrence stand-off. Maybe it should.

Edit: On re-reading your message I see that you aren't claiming that the shorthand syntax does a standoff, but that it could.

What the shorthand syntax

:a :b :c {| :e 1; :f 3 |} .
:a :b :c {| :e 2; :f 4 |}.

is shorthand for is

:a :b :c .
<< :a :b :c >> :e 1; :f 3  .
:a :b :c .
<< :a :b :c >> :e 2; :f 4 .

See https://www.w3.org/2021/12/rdf-star.html#example-9

pfps commented 1 year ago

This issue, as is stated at the beginning, is taken from an issue to the RDF-star community group. As such, the issue keeps the discussion from there.

If there is a place to modify the examples, it is in the Wiki page. But the question then is what to use instead of temporal or certainty factors. I don't think that changing to provenance is a good idea precisely because provenance is so far from temporal or certainty factors.

I had reached out to the original submitter but didn't hear back so I have started the process of expanding the use case using only the existing discussion. In any case, the use case is to capture LPGs and my understanding of the use of LPGs is that this kind of annotation is quite common.

ericprud commented 1 year ago

In any case, the use case is to capture LPGs and my understanding of the use of LPGs is that this kind of annotation is quite common.

I agree that it's (more) common (than it should be). I propose that it introduces unexpected usage constraints that nullify the value proposition of making surface graph simple. If you must always remember to look for an :ending qualifier, the simplicity of the :worksAt statement is only a moral hazard. People in a small project can learn to work around these problems, but I don't want to suggest that readers of RDF-Star repeat sloppy PG models in the LOD cloud, where they will result in misleading query results and, ideally, modelling churn.

If you like the :worksAt use case more than my wrinkly old peas, at least uses a predicate that steers people in the right direction, e.g. :workedAtForSomeTimeInterval.

rat10 commented 1 year ago

@ericprud The ways that statement annotation can be used are countless and such would be the number of predicates your proposal would require, like :workedAtForSomeTimeIntervalWithCertaintyDegreeAccordingToDerivedByEmbeddingetc. @pfps In the absence of Pavel there is still that little compilation I collected from examples used in LPG literature. But you can also look at the large corpus of works trying to extend the expressivity of binary RDF in one way or another. Look alone at all the different usages that named graphs have found besides just administrative house keeping tasks and you have a good idea of what RDF-Star will be used for - not mis-used I'd like to emphasize.

The bottom line is: LPG-style modelling (with or without RDF-star) is a way to distinguish a primary topic from secondary attributes. For every n-ary relation it largely depends on the application which aspect is considered primary and which is secondary. Sure there are the usual aspects like temporality (valid now or only some time?) but by and large we can't know in advance which aspect will be considered dominant in some application. We can put up a sign saying "This is RDF-star and you will use it wrongly!" or we can adjust.

rat10 commented 1 year ago

Edit: On re-reading your message I see that you aren't claiming that the shorthand syntax does a standoff, but that it could.

Indeed (although I don't really get what you mean by "standoff"). I could go even further and say: Let's go back to the 2017 version of RDF-Star (then RDF*) in which embedded triples were asserted. That would make the shortcut syntax unnecessary. Let "unasserted assertions" either be handled by RDF standard reification or by a graph literal datatype. Let also all the use cases that require syntactic fidelity like versioning, explainable AI, etc. be handled by graph literals. Let the intircacies of syntactic blank nodes be dealt with by Concise Bounded Descriptions in graph literals. Let the semantics of embedded triples by defined as outlined above, analogous to singleton properties. Even Souri's RDFn proposal can be represented that way. This might really be a succinct and sound way to cover most demands (proper statement identifiers would be better suited to realize deeply nested structures, but lets postpone that as a syntactic issue for the moment).

ericprud commented 1 year ago

@ericprud The ways that statement annotation can be used are countless and such would be the number of predicates your proposal would require, like :workedAtForSomeTimeIntervalWithCertaintyDegreeAccordingToDerivedByEmbeddingetc.

I agree that if you have a database with an ubiquitous meta-model where all assertions have starts, ends, certainties and provenance, you don't have to roll them into your predicate name. The way the original use case was presented illustrated illustrated a common modeling pathology in PGs; someone introduces a (non-mon) annotation, changing the interpretation of the graph, without calling up every other user and saying "you now always have to look for X". Since most PGs are either private or narrowly-scoped, readers of, e.g. https://neo4j.com/developer/graph-database/ can probably delay confronting these issues without too much cost. However, when setting readers' expectations for what RDF-Star can do for them, I believe the UC&R should pick one or more of:

  1. avoid assertions that are inherently temporary (worksFor, drives, many present tense verbs)
  2. include examples that explicitly show that all surface assertions have a fixed set of revoking qualifiers (end, low-confidence)
  3. provide a non-mon example and point out that e.g. adding until after the fact was non-mon and breaks exisiting queries.

Tx for examples used in LPG literature; helps scope and ground the conversations! I tried to characterize each usage into persistent or temporary or whether it had a determinant like "probability":

predicate
acted_in persistent
CONNECTED_TO persistent
drives temporary, no revocation
CONNECTION persistent
knows persistent
Married persistent
MANAGES temporary, no revocation
FOUNDED persistent
tickit:like temporary, no revocation
tickit:friend temporary, revoked with "endDate"
hasAuthor persistent
booktitle persistent
publishedIn persistent
rdf:type temporary, revoked with "until"
worksAt temporary, no revocation
:worksFor determinant "probability"
:birthDate determinant "probability"
:hasSpouse temporary, no revocation
:nationality persistent
:height persistent
foaf:name persistent in most models
example:worksFor temporary, no revocation
foaf:age exceedingly temporary

The AnzoGraphDB tickit:* example avoids the present tence and implies a meta-model with e.g. "endDate" annotations.

pfps commented 1 year ago

@pfps In the absence of Pavel there is still that little compilation I collected from examples used in LPG literature. But you can also look at the large corpus of works trying to extend the expressivity of binary RDF in one way or another. Look alone at all the different usages that named graphs have found besides just administrative house keeping tasks and you have a good idea of what RDF-Star will be used for - not mis-used I'd like to emphasize.

Thanks for the pointer. But it is still hard to get a good example. What is needed is something that can be correctly done in Labelled Property Graphs (so no using strings for things), that is atemporal (no dates or, maybe, only dates that don't give rise to the appearance of non-montonicity), binary, and isn't about provenance. Perhaps the air route example would do.

rat10 commented 1 year ago

What is needed is something that can be correctly done in Labelled Property Graphs (so no using strings for things), that is atemporal (no dates or, maybe, only dates that don't give rise to the appearance of non-montonicity)

I have argued above why I think the monotonicity argument is misled. How many facts are eternally and absolutey true? If a statement says that Alice buys a car and a second statement adds that that car is red, does that make the first statement false? How many properties in established ontologies are specific about their temporal aspect? It is the application that controls if the user gets only currently valid data or data that has been true at some time in space. It is the users responsibility to check if the application works as expected or if more querying is needed.

pfps commented 1 year ago

The solution then is to submit a use case that is explicitly about temporal or other non-monotonic information.

rat10 commented 1 year ago

As outlined above IMO temporal annotations are no more non-monotonic than any other annotations. I suggest you give a principled account of what you think are non-monotonic annotations. I see no way to do this except from barring annotations that explicitly declare another statement as being universally false. That restriction is well known and easy to communicate.

Going this route would also require an idea of how to educate users of RDF-star in a succinct and unambiguous way which annotation domains are to be avoided, because that would be some very important information to give. I suspect it would be met with scepticism.

[EDIT] You can formalize what so far was rather a "prudent approach" by the CG report: RDF-star can only be used for administrative house keeping and strictly close-to-the-metal, out-of-band application specific tasks. That will however lead to three questions:

IMO it would be much more sensible to adjust in the way I outlined above: don't force users to manoeuvre around STOP-signs, but teach them how to adjust their expectations.

pchampin commented 1 year ago

I wanted to react on this particular part of the wiki page of this UC

There are two differences between LPGs and RDF that affect this mapping. (...)

  • In LPGs there can be multiple edges with the same start node, end node, and relationship type.

I can see two straighforward ways to work around this problem

I believe this satisfies the constraints of the original UC:

For example, if the customer has :pavel :worksAt :Stardog edge in the data and wants to add ... :since 2011 to it, neither they nor the database should have to transform it into a bunch of different triples like [] rdf:subject :pavel ; rdfs:predicate :worksAt ... (and then also rewrite queries so that ?s :worksAt :Stardog still returns :pavel).

rat10 commented 1 year ago

I agree that changing the abstract syntax is scary and for sure can't be done by a WG without extensive prior work and without being tasked to do just that. However, I have a different opinion on what bullets we have to bite.

In :a :b :c {| rdf:LPGedge [ :d 1 ] |} . an anonymous blank node identifies the statement. In :a :b :c {& :d 1 &} . the blank node is hidden in the special {& ... &} syntax. In the unabbreviated syntax this would be

:a :b :c .
<< :a :b :c >> rdf:LPGedge _:x, _:y .
_:x :d 1 .
_:y :e 1 .

and if I'm not mistaken, that is what one would have to query for to not miss annotations on occurrences. IMO that is prohibitively complicated, especially as one can't be sure in advance which modelling decisions have been made by the creator of this data. E.g. as long as a multi-part annotation on a statement is the only one of that type, an author would be excused to not use the more involved rdf:LPGedge construct. So formulating a query would require either some guessing about the number of annotations or adding queries for both cases. That's not a very sustainable prospect.

This proposal is just syntactic sugar for what the CG report already propsoses. One could easily replace these blank nodes with proper IRIs. So the proposal uses statement identifiers, just hidden in syntax. Semantically this then maps very closely to RDF standard reification which talks about some occurrence/instance/token of an abstract statement. However, there is no direct link between a statement of the same form in some snippet of RDF and that reification (no matter if the syntax is RDF standard reification or your example above or say the RDF/XML id attribute), as there can't be. W.r.t. re-modelling this gives an edge to RDF standard reification which simply doesn't allow one to annotate a statement without first defining an identifier. Of course standard reification can't annotate types, but then again: I have yet to see a sound and concise description of which annotations are allowed in RDF-star, but many things seem to be considered not kosher - even temporal annotations!

I think the WG should look at Singleton Properties again. They provide a semantically sound way to annotate statement instances. They lack a bit w.r.t. support on the surface as the proposal wasn't bold enough to suggest an extension to the syntax, but RDF-star provides that extension...

The underlying conflict is that RDF compacts all statements of some type into one type, and in the process necessarily loses detail. This seems to be a conscious decision in the design of RDF, favoring integration over differentiation, but many users/applications do indeed find those details important. Understanding each statement instance as a subtype and supporting that in implementations, but out of view of users, would keep those details. It wouldn't break the abstract model, but it would stop breaking applications. In other words: it might be useful to bite the bullet and drop the early optimization on statement types. Instead keep them separate in the application and only merge them in query results, not too different from the way blank nodes are treated in practice: counting semantics in SPARQL, leaning and existential semantics for reasoning.

[EDIT] In RDF anything can be modeled as an n-ary relation. Imagine the above example as one:

:a nary:b_1 :c ;
   nary:b_2 :c  .
nary:b_1 rdf:type :b;
         :d 1 .
nary:b_2 rdf:type :b ;
         :e 1 .

Now define that this is what

:a :b :c  {| :d 1 |} .
:a :b :c  {| :e 1 |}  .

maps to. Add support in SPARQL to omit querying for the type.

domel commented 12 months ago

My minor comment concerns the name. Specifically "labeled". I think it's better to just use a "property graph". There are several reasons:

  1. I think the more common name is "property graph"
  2. some definitions generalize labels as unary properties (and do not use the term label).
  3. "labelled" suggests that labels are necessary (occur once or more), and some definitions (e.g. from the ISO GQL standard) allow defining PGs without labels (allow zero or more labels).

To sum up, I suggest "labelled property graphs" -> "property graphs".

pfps commented 12 months ago

The rationale for "labelled" is that nodes and edges can have labels, not that they all have to have labels. It may be that the non-labelled version is more common.