w3c / rdf-star

RDF-star specification
https://w3c.github.io/rdf-star/
Other
119 stars 23 forks source link

Should RDF* be just syntactic sugar on top of RDF? #37

Closed pchampin closed 3 years ago

pchampin commented 3 years ago

In other words, does RDF* need its own abstract syntax and semantics, or can it be "encoded" in standard RDF?

It largely depends on the answer to issue #22.

  1. If we want embedded triples to be referentially transparent, then they can be internally represented using, e.g., standard reification or singleton properties.
  2. If we want them to be referentially opaque, and forbid them to contain blank nodes, then they can be internally represented using specific IRIs (see #23).
  3. Otherwise, IMO, we need to somehow extend RDF semantics. However, I see still see two paths here: a) either we promote RDF triples as a new kind of terms, as done in the original papers and the current version of the report, or b) we extend RDF's semantics with a built-in datatype for representing IRIs and literals, and we represent RDF triples using an adapted form of reification.

To illustrate the last bullet:

<< :alice :age 26 >> :accordingTo :bob.

could be seen as syntactic sugar for

_:stmt :accordingTo :bob.
_:stmt
    rdfx:subjectTerm "<http://example.org/alice>"^^rdfx:term;
    rdfx:predicateTerm "<http://example.org/age>"^^rdfx:term;
    rdfx:subjectTerm "\"26\"^^<www.w3.org/2001/XMLSchema#integer>"^^rdfx:term.

There are several reasons why I believe this modelling needs a small extension to RDF semantics, but I'll develop them if we come to a point where we consider this option seriously...

pchampin commented 3 years ago

DISCLAIMER: if we decided that RDF* can be encoded in standard RDF (e.g. using standard reification), that would not impose that implementers store it that way in their systems. They could still use internal optimizations, e.g. to avoid the overhead of storing additional 3 or 4 triples for every reified triple...

afs commented 3 years ago

Good issue. If we can decide this, we can move on.

On the idea of encoding in triples -- using multiple triples, and a well-formed-ness condition -- as a basic level feature:

Long ago, Jena had special code to optimize reification in storage. Jena does not have such code any longer - it was virtually unused and created a significant cost on building storage subsystems.

The trouble is not reification itself, it is the well-formed-ness condition and handling of partial reification, unfinished reification and wrong reification.

Suppose you have [] rdf:subject S ; rdf:predicate P ., and then handling [] rdf:object O a few billion triples later (same parser run) is a problematic. Or [] rdf:subject S1 ; rdf:subject S2; ... or a <uri> rdf:subject S1 and then <uri> rdf:subject S2 turns up.

The impl has to undone plain storage, or buffer in memory, in case the partial triples are completed - neither is very nice. If we want to enable lightweight RDF toolkits that work with any data, if we want toolkits in an many languages and technology ecosystems as possible, adding to the implementation costs is a factor.

The same occurs with RDF lists (collections).

For lists, the encoding of a structure in RDF triples makes it a lot of work to optimize as well as handling all the "incorrect" cases; two rdf:rest, shared tails, cycles, URIs for a cons cell.

Code may still access these are RDF triples - what's SELECT (count(*) AS ?C) { ?s ?p ?o } now?

The lesson I draw is that encoding in triples and having assumptions, yet still be general RDF, makes implementation harder, by which I mean, a significant amount of work, for what is, by-in-large, a corner case.

If a system is specialised for a usage pattern of RDF and not the general case - different story (are there any apart from parsing RDF into OWL?).

Encoding structures and general case is not a free lunch.

At the syntax level, let alone semantic interpretation, the advantage of adding a new kind of RDF term in <<>> is that the cases that don't follow the "well formed" assumptions then simply do not occur. Robust toolkits don't have to deal with the not well-formed cases.

Then there is the utility of syntax like :a :p :o {| :source <somePlace> |}.

PS
Once upon a time, early on in DAWG (the SPARQL 1.0 working group), before it was formally called SPARQL, SPARQL had syntax for reification. It was <<?s ?p ?o>>, and <<?id ?s ?p ?o>> to bind the reification subject term. It did not find favour.

ericprud commented 3 years ago

Encoding structures and general case is not a free lunch.

At the syntax level, let alone semantic interpretation, the advantage of adding a new kind of RDF term in <<>> is that the cases that don't follow the "well formed" assumptions then simply do not occur. Robust toolkits don't have to deal with the not well-formed cases.

I was recently aggravating myself by imagining if and how much the SemWeb would be more useful if ordered, bounded collections were first class objects. IMO, their lack has cost us years and countless opportunities.

Then there is the utility of syntax like :a :p :o {| :source <somePlace> |}.

Can you say what that should mean? Does it just turn the :source arc around from <somePlace> :isSourceOf <<:a :p :o>> ?

PS Once upon a time, early on in DAWG (the SPARQL 1.0 working group), before it was formally called SPARQL, SPARQL had syntax for reification. It was <<?s ?p ?o>>, and <<?id ?s ?p ?o>> to bind the reification subject term. It did not find favour.

Not immediately; it just took 15 years.

hartig commented 3 years ago

Then there is the utility of syntax like :a :p :o {| :source |}.

Can you say what that should mean? Does it just turn the :source arc around from :isSourceOf <<:a :p :o>> ?

No, instead, see #9

pchampin commented 3 years ago

Today we had a fruitful discussion on the link bewteen RDF and standard reification. A strawpoll followed, showing that a majority in the call would be happy with RDF being syntactic sugar for standard reification (possibly with more constraints).

hartig commented 3 years ago

I believe that the question of whether RDF is (or should be) just syntactic sugar for RDF reification may be confusing or misleading to some. From an implementer's perspective, "syntactic sugar" may be interpreted as something that would be replaced during parsing and not considered internally in a system. However, I guess that this is not the intention of the question. Instead, I assume that the question is more about the semantics of embedded triples rather than about requiring systems to actually convert RDF graphs into RDF graphs.

TallTed commented 3 years ago

"However, I guess" and "Instead, I assume" don't generally lead to good results. Here, for instance, they appear to lead you in a direction with which I do not agree, and with which I believe many others in the RDF-using community will take exception.

If you're not clear on what someone means, you should ask them to confirm your interpretation and/or otherwise clarify whatever they've said.

From my perspective, syntactic sugar provides writers of a syntax with simplified ways to write complex things, where such simplified notation can always be losslessly translated to and from the complex notation. In other words, it is intimately and inseparably bound to that syntax which it is sweetening (hence, sugar).

In Turtle, one such piece of syntactic sugar allows authors to write a instead of rdf:type. The a always means rdf:type; no more, no less.

Another piece of sugar allows the use of ; to avoid repetitive rewriting of Subject terms, and the use of , to avoid repetitive rewriting of Predicate terms. Neither of these eliminates triples -- an ingesting triple store winds up with the same triples that would have been there had the Turtle author used "long-hand" to fully inscribe all the triples.

Note that neither of these pieces of sugar do anything to the model they are expressing.

Assuming the { << :a :b :c >> :d :e } notation is syntactic sugar within Turtle*, which is an evolution of Turtle and not of RDF, per se, I expect that sugar most likely to be shorthand for a blank node, in one way or another.

That is, it might connote --

{ [] a             rdf:triple ;
     :d            :e ;
     rdf:subject   :a ;
     rdf:predicate :b ;
     rdf:object    :c .
}

On the other hand, it might require literals in the subject position, which, while part of non-normative Generalized RDF, is not otherwise part of the RDF TR. (See 3.1 Triples.) In this case, << :a :b :c >> would be the literal in question.

On the gripping hand, given my reading thus far, you do not really mean Turtle to "just" or "simply" be syntactic sugar for Turtle. I think you cannot mean this, because then there would be no need for RDF, which implies changes to the underlying model -- which require things be much more rigorously considered than they seem to have been to date -- and which cannot simply be syntactic sugar because the model is much more than any of the syntaxes used to express it.

gkellogg commented 3 years ago

Personally, I'd like to consider using the Reification vocabulary as a reification of RDF* embedded triples, which may be useful for some implementations, but it's useful to be able to consider triples as their own entities. This differs from our recent straw-poll, and that ship may have sailed.

But, IIRC, RDF/XML reification would create a separate blank node for each reified statement for each separate use of rdf:ID on otherwise identical triples; we should clarify that the following would hold:

<< :a :b :c >> :d :e .
<< :a :b :c >> :f :g .

Results in the following:

[ a rdf: Statement;
  rdf:subject :a;
  rdf:predicate :b;
  rdf:object :c;
  :d :e;
  :f :g
] .

And not the the less-lean variant:

[ a rdf: Statement;
  rdf:subject :a;
  rdf:predicate :b;
  rdf:object :c;
  :d :e
] .

[ a rdf: Statement;
  rdf:subject :a;
  rdf:predicate :b;
  rdf:object :c;
  :f :g
] .
akuckartz commented 3 years ago

Is

<< :a :b :c >> :d :e . << :a :b :c >> :f :g .

the same as

<< :a :b :c >> :d :e ; :f :g . ?

gkellogg commented 3 years ago

Is

<< :a :b :c >> :d :e . << :a :b :c >> :f :g .

the same as

<< :a :b :c >> :d :e ; :f :g . ?

It is not an entirely settled issue, but IMO << :a :b :c >> identifies a unique triple, and so it's reified representation using rdf:Statement would use a single blank node subject.

TallTed commented 3 years ago

@gkellogg

This differs from our recent straw-poll, and that ship may have sailed.

Straw polls are not binding in any case. In this case, any resolutions based on that straw poll are less than 7 days old, so should be tentative pending ratification (including by being ignored) by those of the group who were not present at the last meeting (which includes me, and, I guess, you).

I'm not sure whether I agree with your description here, but I'm pretty sure I disagree with the straw poll, and I aim to join the next call. (Please bear with me, folks; I'm gradually returning to full duties after a rough year of cancer treatments and side effects [prognosis is that I should now be fully clear, with recurrence is a low likelihood].)

(more coming, addressed to all)

TallTed commented 3 years ago

Sentences occur in multiple places, without being the same sentence, without carrying the same meaning or import, because of their context, including author, time, and probably other attributes.

So, too, with RDF triples (a/k/a RDF sentences).

{ :a :b :c } carries different weight and meaning when uttered by a child, than when uttered by a subject matter expert. It is thus important to be able to say that this utterance was by :Billy :age 5 in the context of a party game, and that utterance was by :MrSpock :age 55 and this other utterance was by :DrMcCoy :age 40, both the latter in the context of an official analysis (which might have been in a fictional work, or those might be aliases of undercover agents somewhere/somewhen). This is the argument of those who want to specifically identify each triple occurrence, including graph/surface, emitter, timestamp, etc.

Or is it important thus?

Perhaps it is only important to be able to say that some utterance(s) was by a child during a game, and other utterance(s) were by subject matter experts in official analyses.

This level of analysis does not require a specific identity for any of the utterances, only that each cluster of provenance triples be maintained as a cluster -- e.g., as a named graph -- each of which is a description of the same combination of { :a :b :c }.

I submit that you can cut each and every occurrence of a given simple sentence from each and every work in which it appears, shuffle them randomly, and replace them -- each landing in a different "original source" -- and nothing would change about any of their meanings or importance. The fact that this one was printed in Garamond, and this in Helvetica, and that in Times New Roman, is not important. Nor is that the fact that snippet-14 originally came from book-14 and is now in book-8, and snippet-8 is now in book-9, and snippet-9 is now in book-14.

I submit that the same is true for RDF triples, and even for quads. The fourth element of those quads is not inherently important; it is only important when analyzing (or emitting) provenance or quality or similar qualities of those triples. (This is part of why the fourth element was given such short shrift in the original RDF development efforts; getting the other three right was far more important.)


@akuckartz

Is

<< :a :b :c >> :d :e . << :a :b :c >> :f :g .

the same as

<< :a :b :c >> :d :e ; :f :g . ?

I would say "yes," because the ; is just syntactic sugar that lets us avoid enscribing the identical subject entity, which here is << :a :b :c >>, itself being syntactic sugar for [] a rdf:triple ; rdf:subject :a ; rdf:predicate :b ; rdf:object :c .

pchampin commented 3 years ago

@hartig

Instead, I assume that the question is more about the semantics of embedded triples

No, I really meant "syntactic sugar" -- and not quite advocating it, by the way :wink: . I think that embedded triples as syntactic sugar work well as long as they contain only ground terms. Blank nodes (as usual), make things trickier...

pchampin commented 3 years ago

@TallTed

I would say "yes," because the ; is just syntactic sugar

Well, consider

[ a :Thing ] :p1 :o1; :p2 :o2.

It is not the same as

[ a :Thing ] :p1 :o1.
[ a :Thing ] :p2 :o2.

despite the fact that ; is syntactic sugar.

So it is legitimate to ask that question for << ... >>. My understanding of the original papers is that << ... >>, unlike [ ... ], denotes the same thing everywhere it occurs. But clearly, not everyone reads it like this -- or thinks it is a good idea.

<< :a :b :c >>, itself being syntactic sugar for [] a rdf:triple ; rdf:subject :a ; rdf:predicate :b ; rdf:object :c .

If that was the case, then your answer should be "no" rather than "yes" (because of the [] in the "unfolding") :smiling_imp:

TallTed commented 3 years ago

@pchampin -

Quite so. Syntactic sugar in general, and the [] expression of blank nodes in specific, is yet another example of deceptive simplicity.

Keeping track of the entities referred to by such pronouns (blank nodes) is too often a challenge. Hence why they (pronouns and blank nodes, both) should be avoided except when absolutely necessary, or when clarity is easy to achieve -- which situations do exist, just far less frequently than many seem to think.

Your example does bring me to note that { [ a :Thing ] :p1 :o1 ; :p2 :o2 . } is not one of those easy-clarity situations, though { _:a a :Thing ; :p1 :o1 ; :p2 :o2 . } is, and I believe you'll concur says the same as --

_:a a   :Thing . 
_:a :p1 :o1 .
_:a :p2 :o2 .

-- and will as long as those three triples are found in the same graph. The glory of pronouns that give no hints as to their referents (gender, plurality, am I forgetting another?, a la "he" vs "she" vs "it" vs "they")!

rat10 commented 3 years ago

@pchampin

So it is legitimate to ask that question for << ... >>. My understanding of the original papers is that << ... >>, unlike [ ... ], denotes the same thing everywhere it occurs. But clearly, not everyone reads it like this -- or thinks it is a good idea.

We should have both options. Let an embedded triple denote the same thing everywhere, as you suggest. This is a useful default as it is a semantically correct way to add facts to facts, and even if it isn't semantically correct it often still works as applications provide context and data is targeted at applications. However if we do need to denote a specific triple - because we need to be precise or the usecase depends on it - the embedded statement should be able to also identify the graph and, again only if necessary (like in the WIkiData case), even a distinct reification of the triple .

So: << :a :b :c >> denotes the same triple everywhere, like any IRI or literal, whereas << :a :b :c :g >> denotes a specific occurrenec of the triple in some graph. In the WkiData usecase, where even this is not specific enough, we need to add an identifier (equaling the subject of a reification quad in RDF reification) << :a :b :c :g#id >> This needs some effort during parsing, and query answering has to return and render more specific results appropriatly. The advantage is that the least involved case is the easiest to author and query. More specific cases naturally expand on it it. This shifts the burden away from the user to the aplication - which is how it should be. The solution you propose - adding one more indirection - moves the burden on the user: since one can't know beforehand if an embedded triple has one or more annotatiosn of some kind one always needs to query both cases. Or authoring becomes more verbose by default. Both not very enticing perspectives.

pchampin commented 3 years ago

@rat10 just to be clear (and because you made this comments in a thread about "syntactic sugar"): do you consider that << :a :b :c :g >> or << :a :b :c :g#id >> would be syntactic sugar for a more complex expression (possibly involving << :a :b :c >>)? In which case we would be in violent agreement :wink:...

afs commented 3 years ago

We know that reification is considered too verbose; we ought to address this concern.

My interpretation of "syntactic sugar" is "behaves like". That is not "exactly the same".

A system that expanded the new RDF* syntax to reification (controlled, documented mapping - e.g. one refication per unique triple term) will capture the same information and can pass the translation around.

Details may differ. Translated it responds to matching "rdf:subject" while a triple-term system does not; counting triples differs.

Mixing RDF* and existing reification is undefined if they overlap.

rat10 commented 3 years ago

@pchampin

@rat10 just to be clear (and because you made this comments in a thread about "syntactic sugar"): do you consider that << :a :b :c :g >> or << :a :b :c :g#id >> would be syntactic sugar for a more complex expression (possibly involving << :a :b :c >>)?

In RDF* defined as syntactic sugar for RDF standard reification the << :a :b :c :g#id >> would be syntactic sugar for

#id a rdf:Statement ;
    rdf:subject :a ;
    rdf:predicate :b ; 
    rdf:object :c ;
    rdfx:inGraph :g .

Some technicalities:

In which case we would be in violent agreement 😉...

No violence, please! ;-)

pchampin commented 3 years ago

In RDF* defined as syntactic sugar for RDF standard reification...

And if it was not (that is, if << :a :b :c >> was a new kind of term), would you be happy with << :a :b :c :g#id >> being syntactic sugar for:

<#id> a rdf:Statement ;
    rdfx:triple << :a :b :c >>;
    rdfx:inGraph :g .

? (which is almost twice more concise than the standard-reification based approach)

TallTed commented 3 years ago

I'm a little worried about the # fragment separator being used in this way.

I'm also wondering how @rat10 came to the conclusion that Virtuoso is a quint store. I'm pretty sure that we store S,P,O,G, i.e., quads, not quints (presumably adding a rowid of some sort?).

rat10 commented 3 years ago

@pchampin

In RDF* defined as syntactic sugar for RDF standard reification...

And if it was not (that is, if << :a :b :c >> was a new kind of term), would you be happy with << :a :b :c :g#id >> being syntactic sugar for:

<#id> a rdf:Statement ;
    rdfx:triple << :a :b :c >>;
    rdfx:inGraph :g .

? (which is almost twice more concise than the standard-reification based approach)

Yes, of course. I'm happy without any RDF standard reification syntax altogether.

@TallTed There was a blog post by Orri Erlang some years ago where he asked if people wanted that internal identifier exposed. Probably some primary key, rowid of some sort. I agree that the fragment identifier may be problematic. But that should be a technicality. The question I'd like to discuss (and just illustrate with my examples) is foremost what it takes to properly address triple occurrences, something that RDF standard reification states as its purpose but fails to accomplish.

TallTed commented 3 years ago

@rat10 I think this is the blog post you meant? Events have indeed overtaken some of its content (not surprising after 11 years). The default RDF indexing, for instance, is now a 2+3 combination over the SPOG table --

  • PSOG - primary key
  • POGS - bitmap index for lookups on object value.
  • SP - partial index for cases where only S is specified.
  • OP - partial index for cases where only O is specified.
  • GS - partial index for cases where only G is specified.

-- which is sufficient for most needs. Additional indexes are sometimes created for specific deployment needs, but adding columns to the table is vanishingly rare.

There was no substantive interest in adding a rowid or the like. The PSOG primary key has apparently been sufficient for most.

pchampin commented 3 years ago

This was discussed during the call on 2021-01-15 https://w3c.github.io/rdf-star/Minutes/2021-01-15.html#item03

pchampin commented 3 years ago

There was general consensus on this issue during the call on 2021-01-15 (see previous comment), and this has been implemented in PR #88, which has been merged.

Although some aspects are still under discussion (e.g. in #101), I think this issue can be closed.