w3c / rdf-star

RDF-star specification
https://w3c.github.io/rdf-star/
Other
119 stars 23 forks source link

Adds a clarification section for the new built-in functions #129

Closed hartig closed 3 years ago

hartig commented 3 years ago

This PR addresses #126

There is a new Section 4.4.6 with examples of the new built-in functions and a discussion of the interplay with the BNODE function and GRAPH clauses.

Notice that my plan is to move the first three examples into the overview/primer part of the report (after creating a proper structure for that part). Hence, after that, what will remain in this new Section 4.4.6 is the discussion of the special cases. However, for the moment I have put it all in this section in order to get feedback on the examples and on the discussion.

/cc @pchampin @afs @gkellogg @TallTed


Preview | Diff

hartig commented 3 years ago

@lisp can you please take a look at this new section a let me know whether it delivers what you were expecting.

lisp commented 3 years ago

i have, but i need to think about it. i was expecting to have time until friday. is that too late?

hartig commented 3 years ago

Friday is fine.

lisp commented 3 years ago

ex 12: as expressed, the presence is in the default dataset, not necessarily the default graph.

ex 13: is the bind not gratuitous? this symmetry may simplify the subsequent text, but will it not also confuse?

ex 14: also has the default dataset as target. “uses just the subject of each asserted…” perhaps call out that count and count distinct wmay yield different values

ex15: “The SELECT clause aggregates…” the discussion should proceed further. for instance, it could introduce the case where each graph contained the single statement

 :s :p _:blank

in an a-typical case, in which the graphs contained statements which include the same blank node and the bind clause is

 BIND ( TRIPLE(?s, ?p, ?o) as ?t3)

i suspect that the aggregate count would also be 1, as ?o is bound to the same blank node. i hope i am not surprised.

given which, it could be informative to discuss the consequences of an process which operated on two initially empty datasets with an initial update operation on the order of

graph ?g {
 ?s ?p ?node .
 ?s ?p _:node }
}
where {
 values ?g {<http://ex.org/1> <http://ex.org/2>}
 bind (bnode(’node') as ?node)
 bind (<http://ex.org/p> as ?p)
 bind (TRIPLE(?node, ?p, ‘o’) as ?s)
}

exported that dataset, imported the intermediate document into the second dataset and then executed queries on each dataset on the order of

select  (count (?node ) as ?count) (count (distinct ?o ) as ?distinctCount)
where {
graph ?g {
 <<?node <http://ex.org/p> ‘o’>> <http://ex.org/p> ?node .
}

or

select (count (?node ) as ?count) (count (distinct ?o ) as ?distinctCount)
where {
graph ?g {
 ?s <http://ex.org/p> ?node .
}

at issue of the scope and extent of the bindings between blank node labels and the respective nodes as well as the extent the consequences for a node which is not in any graph for the results of those queries is not yet self-evident. for example, would the count in the second query be 1 or 0? there could be some value to explaining this.

if, as has been suggested, the answers follow from the semantics rather than from the function which this sections describes, the discussion could well be elsewhere in the document, but in that case, a reference from this section would serve the readers.

afs commented 3 years ago

@lisp What is a "default dataset"? (I think the example 12 text is correct)

TallTed commented 3 years ago

@lisp

the graphs contained statements which include the same blank node

How is this scenario possible? My understanding of blank (i.e., unnamed, unidentified) nodes is that their pseudo-identification is etheric beyond the bounds of an enclosing graph or result set. They may get assigned temporary, etheric, stick-on labels bearing identifiers which persist for the duration of a query and its result set, but those identifiers cannot be used outside of that query and result set.

Yes, you might describe a node without an identifier (URI) of its own with the same predicates/attributes and objects/values in multiple graphs — but how can you or any SPARQL querent determine that the same node is being described in those graphs?

Open World says that anything unstated is unknown. This means that an attribute/predicate that is left out could have any value/object — and, if it had been included, it could be different in every one of the graphs in question, and thus the unidentified node be different.

For instance, you could have 150 graphs, each containing a description of a yellow 2015 Chevrolet Camaro, where the objects/values of all predicates/attributes therein are identical — but none of those graphs include the VIN which, by definition, is different for each Camaro that rolls off the assembly line. Unfortunately for your scenario, every description would have had a different VIN, because each of the 150 graphs actually describes a different Camaro — even though they appear to be the same based on the attributes/predicates in those graphs.

So, in your scenario, "the aggregate count" cannot "be 1" — it must be the number of graphs which contain that blank node, because that is the number of etheric, temporary blank-node "identifiers" (which as stated above aren't identifiers in the larger sense; they are not URIs, and they are only useful within the context of this result set) the SPARQL engine must assign for purposes of that SPARQL result set.

lisp commented 3 years ago

What is a "default dataset"? (I think the example 12 text is correct)

13.2, the section in the sparql 1.1 recommendation about "specifying rdf datasets", describes this as "any dataset that the query service would use if no dataset description is provided in a query."

lisp commented 3 years ago

How is this scenario possible? My understanding of blank (i.e., unnamed, unidentified) nodes is that their pseudo-identification is etheric beyond the bounds of an enclosing graph or result set. They may get assigned temporary, etheric, stick-on labels bearing identifiers which persist for the duration of a query and its result set, but those identifiers cannot be used outside of that query and result set.

the rules which govern the scope and extent of the binding of a label to a blank node are not the same as the ones which govern the extent of the node.

afs commented 3 years ago

Thank you for the clarification.

The whole query is executed against a dataset and indeed here, there is no FROM,FROM NAMED.

The text "contained in the default graph" is correct for the example query because the "?s ?p ?o" is only against the default graph (of the dataset being queried).

lisp commented 3 years ago

... unless, as we have been made aware by some users, the default graph of the default dataset happens to merge the contents of the named graphs.

which may be relevant to this issue, as it concerns the behaviour of this new class of terms with respect to the presence of their containing statements in graphs.

afs commented 3 years ago

How the default graph comes about is not of concern to SPARQL query execution.

The query engine is responsible for the abstraction of the default graph as a set of triples. Sets have a "contains" relationship and "?s ?p ?o" matches all the triples in that set.

lisp commented 3 years ago

as it can result from merging the named graphs, its origin can be of concern.

afs commented 3 years ago

It may be of concern but then this isn't the right query to ask! Use GRAPH ?g { ... } to the containing named graph(s).

For example 12, the text as explanation about the query matching the default graph is correct.

lisp commented 3 years ago

the concern raised in this thread is as to the behaviour of blank nodes constituent to embedded triples which are terms in statements in graphs. the proposed correction seeks

while the current phrasing in not incorrect, it may also be incomplete, for some cases which resolves the uncertainties raised.

afs commented 3 years ago

See https://www.w3.org/TR/sparql11-query/#sparqlDataset - there is one set of blank nodes for the dataset (the data) with no regard to their use in any graph.

either to exclude any possible merged graphs from consideration at this point,

SPARQL (1.0) was intentionally spec'ed to include this possibility.

It does not matter. There is a default graph. It is an RDF graph. Its origin is outside the SPARQL query specification.

to account for their effect

I don't know what this means that is not covered by the definition of an RDF dataset as referenced.

Please provide an example that can not be determined by the current text with alternative possibilities of interpretation.

lisp commented 3 years ago

the cited passage from the sparql recommendation references a definition for "merge" : https://www.w3.org/TR/rdf-mt/#defmerge . that definition describes a result which is constructed such that it comprises "equivalent graphs that share no blank nodes". it is not clear, how that resolves the concern about any "embedded blank nodes" which may be in named graphs which are merged by an implementation which merges all named graphs in the default graph in a default dataset.

afs commented 3 years ago

I am still not clear what the concern with regard to RDF-star is.

Please provide a concrete example (data+query) and say what step in the evaluation of the example in the preview is unclear.

"merge" is well defined for both individual graphs and datasets. (The discussion is in section 13.1).

You will find that several implementations that provide the "merge" with "union" because they use the implementation knowledge that blank nodes across the dataset are distinct IFF they have distinct internal identifiers (not blank node labels in syntax). Jena uses UUID strings for blank nodes uniquely generated by the RDF syntax parsers.

Dydra does not support the default graph being the composite of named graphs does it?

TriG:

PREFIX : <http://example/>
GRAPH :g1 { _:a :p :z1 }
GRAPH :g2 { _:a :p :z2 }

and we know there is exactly one blank node because TriG scopes labels to the document.

But we can not deduce that from these two files:

PREFIX : <http://example/>
GRAPH :g1 { _:a :p :z1 }
PREFIX : <http://example/>
GRAPH :g2 { _:a :p :z2 }

where a merge will have two blank nodes.

A different implementation where two graphs have been separate read in, and blank nodes are, for example, internally distinguished with system identifiers per graph will rename the blank nodes apart on merge (e.g. prepend the graph name to the internal identifier). This is an implementation matter and not related to SPARQL because SPARQ query pattern matching does not start until there is a valid dataset however that comes about.

The RDF1.1 reference is: https://www.w3.org/TR/rdf11-mt/#shared-blank-nodes-unions-and-merges

lisp commented 3 years ago

I am still not clear what the concern with regard to RDF-star is. ... "merge" is well defined for both individual graphs and datasets.

The RDF1.1 reference is: https://www.w3.org/TR/rdf11-mt/#shared-blank-nodes-unions-and-merges

which passage reiterates the one cited earlier. they concern statements in graphs.

how do the various ways to treat blank nodes when performing the merge or union of graphs when constructing a target dataset apply to the blank nodes which are comprised by embedded triples which are present in statements in named graphs?

lisp commented 3 years ago

Dydra does not support the default graph being the composite of named graphs does it?

in our case, for a query which does not specify a dataset, the default dataset does not combine the named graphs into the default graph. a query which intends that combination must designate it as the source selector in the default graph clause of the dataset specification

afs commented 3 years ago

RDF 1.1 MT section 4.1 has text not in RDF 1.0 about blank nodes that are known to be shared between graphs in a dataset.

lisp commented 3 years ago

RDF 1.1 MT section 4.1 has text not in RDF 1.0 about blank nodes that are known to be shared between graphs in a dataset.

those descriptions from the rdf semantics document have been cited directly and indirectly at several places in this thread. they describe the extent of blank nodes which are in graphs and the scope and extent of their designators.
the uncertainty which gave issue to this thread is not related to the behaviour of that category of blank node.

take, however, the case which is described in this rdf-star document as example 1. replace the term :employee38 with _:employee38. the example suggests that the new triple

<< _:employee38 :jobTitle "Assistant Designer" >>

would also not be asserted. one reading of that characterization is that this new triple is not in any graph. a similar conclusion could apply to the triples which are constructed by the built-in TRIPLE operator. one consequence of that could be that the blank node in the result triple, which was designated in that expression by _:employee38, is likewise not in any graph. if that implication is correct, then it is not self-evident how to read the descriptions from RDF 1.1 MT section 4.1 and co. so that they apply to it.

that is, the concern here is, given the reading, that the blank nodes which this operator incorporates into triples are of (or, may be transformed into) a category different from those described by "merge" and co, the text should describe and provide examples which clarify their extent and the scope and extent of their label bindings. it follows also from that reading, that the examples should include cases which construct, target and merge graphs. from that follows, in order that the examples exhibit a uniform structure, they could well uniformly describe the constitution of the target dataset on all cases, including those where it is the default dataset.

pchampin commented 3 years ago

@lisp the confusion comes from the ambiguity of the expression "to be in a graph". In your latest comment, you wrote "one reading is that (...) this new triple is not in any graph. (...) _:employee38 is likewise not in any graph. If that implication is correct ...". I consider that this implication is incorrect, precisely because "in any graph" means very different things in the two parts of this implication. Let me explain.

In standard RDF, graphs are sets of triples. A triple t is "in a graph" g iff tg. An RDF term u can not be an element of g, so when we say that u is "in the graph" g, we obviously mean something else, namely that u is used in a triple that is an element of g.

In RDF-star, triples are sometimes elements of a graph (asserted triples), and sometimes terms of that graph (embedded triples). So saying that a triple is (not) in a graph is ambiguous. We should always specify that "t is asserted in g (formally: tg) or "t is embedded in g" (formally: tconstituents(g)).

In your example above, the triple _:employee38 :jobTitle "Assistant Designer" is indeed not asserted in the graph, but it is obviously embedded in the graph. In other words, it is in the constituents of the graph, and likewise, _:employee38 is also in the constituents of that graph (because constituents is defined recursively).

Granted, whenever the RDF spec says something about the "terms in a graph", one may wonder how this translates to RDF-star: does this concern only the terms appearing as subject, predicate and object of some asserted triple, or does it concern all the constituent terms? I argue that the latter interpretation should be preferred, because the former creates an unspecified gray area of terms that are neither "in the graph", nor totally "out of it" (at least, not as much "out of it" as terms that are not even in the constituent terms).

lisp commented 3 years ago

whenever the RDF spec says something about the "terms in a graph", one may wonder how this translates to RDF-star: does this concern only the terms appearing as subject, predicate and object of some asserted triple, or does it concern all the constituent terms?

this is the question which i have been posing for more than a week. and, while my uncertainty as to how to manage blank nodes which are in a graph while their containing triple is not, may lead me to pose the question, that is a separate question. whatever its answer may be, the concern remains, that the description of an operator defined to yield such a result should either include the answer or identify its location.

afs commented 3 years ago

There could be clarification that RDF merge, and the blank node naming apart, applies to embedded triples.

Personally, the language in MT 1.1, while not designed for RDF-star says that. Renaming keeps the shape of the graph (isomorphism) so renaming happens for renaming terms (blank nodes) in embedded triples.

pchampin commented 3 years ago

@lisp now that we have an explicit text about merge in the spec, do you consider the content of this PR to be sufficient?

lisp commented 3 years ago

that sentence is sufficiently definitive and that it be there was the goal of the objection.

pchampin commented 3 years ago

this was discussed during today's call https://w3c.github.io/rdf-star/Minutes/2021-04-09.html#x069

@hartig I propose that we merge this PR as is for the next public draft. Moving the first examples into the Overview can be done afterwards (see also #155).

hartig commented 3 years ago

@pchampin sounds good. Thanks!