Representing triple origin information during Federated SPARQL querying

rubensworks commented 1 year ago

See https://github.com/w3c/rdf-ucr/wiki/Capturing-triple-origin-in-SPARQL-star for a version of this use case.

Provide sufficient information so that a member of the working group's Use Case Task Force can contact you and enhance your description so that it can be used by the working group to guide their activities. You do not have to fill out all the information requested.

** Contact information

Your name: Ruben Taelman
How to contact you: ruben.taelman@ugent.be

** Brief Description of your use case:

When executing a Federated SPARQL Query (i.e., a query across multiple SPARQL endpoints), users may want to know which sources contributed to which query results.

*** What you want to be able to do:

When executing a Federated SPARQL Query, I want to annotate triples with the source they originate from.

*** What is the role of RDF-star quoted triples in your use case:

For example, the following query could produce all triples with corresponding ?source URL.

SELECT * WHERE {
  ?s ?p ?o.
  << ?s ?p ?o >> :federatedSource ?source.
}

*** Why it is hard or impossible to do what you want to do without quoted triples:

This could be achieved using named graphs, but semantics may clash with other usages of named graphs.

*** How you want quoted triples to behave in your use case:

(For example, do you want the precise syntax of subjects, predictes, and objects in quoted triples to be important?)

N/A

*** An example RDF graph that shows part of your use case:

N/A

Similar to the "Combination of RDF-star and graph-level metadata (named graphs)" use case, this use case has as limitation that it's not possible to annotate triples inside named graphs. For instance, the following may be desired by users, but this is not possible given the restriction of RDF-star to only annotate triples:

SELECT * WHERE {
  GRAPH ?g { ?s ?p ?o }.
  << GRAPH ?g { ?s ?p ?o } >> :federatedSource ?source.
}

If extending RDF-star to named graphs is not desired, then this limitation could be worked around as follows (alternatives may be possible):

SELECT * WHERE {
  ?s ?p ?o.
  << ?s ?p ?o >> :federatedSource [ :federatedSourceUrl ?source, :federatedSourceGraph ?g ] .
}

pfps commented 1 year ago

Can you provide an example of just what you want, including a description of the behaviour of the remote SPARQL system, the graphs that it uses, and the resulting quoted triples? From your description it seems to me that significant changes to SPARQL are required so that all the remote triples are passed back to the calling SPARQL system which then constructs a set of local triples.

Also, wouldn't this be a useful service locally so that you could see what triples a non-federated query used to generate its results?

rubensworks commented 1 year ago

From your description it seems to me that significant changes to SPARQL are required so that all the remote triples are passed back to the calling SPARQL system which then constructs a set of local triples.

Indeed, significant changes would be required to SPARQL engines. I'm not suggesting to include such functionality in the scope of this WG, but merely to open to door to add such functionality in custom implementations in the future by making use of quoted triples.

Can you provide an example of just what you want, including a description of the behaviour of the remote SPARQL system, the graphs that it uses, and the resulting quoted triples?

Assume we have the following endpoints with datasets:

http://example.org/endpoint1/sparql:

:Alice :name "Alice".
:Alice :knows :Bob.

http://example.org/endpoint2/sparql:

:Bob :name "Bob".
:Bob :knows :Alice.

Federated query across the two endpoints:

SELECT * WHERE {
  ?personA :knows ?personB.
  ?personB :name ?name.
  << ?personB :name ?name >> :federatedSource ?sourceOfName.
}

Results:	personA	personB	name	sourceOfName
:Alice	:Bob	"Bob"	http://example.org/endpoint2/sparql
:Bob	:Alice	"Alice"	http://example.org/endpoint1/sparql

pfps commented 1 year ago

Thanks for the quick clarification.

As far as I can tell all this can be done without having quoted triples in any RDF graph. The connection to quoted triples is that the SPARQL query has quoted triples, perhaps in this form:

SELECT * WHERE {
  ?personA :knows ?personB.
  ?personB :name ?name {| :federatedSource ?sourceOfName |} .
}

My thought is that this could also be done by using some special SPARQL syntax, perhaps like:

SELECT * WHERE { ?personA :knows ?personB. ?personB :name ?name SOURCE ?sourceOfName . }



where SOURCE is a SPARQL keyword.

Would this be fine by you?

rubensworks commented 1 year ago

Adding a custom keyword for this to SPARQL could be an option indeed, but the problem of that is that this keyword is not standardized, and would cause existing parsers and engines to fail with a syntax error, which would not happen with the quoted triples approach.

pfps commented 1 year ago

I think that the failure mode with the explicit quoted triples is just as bad. No SPARQL 1.1 engine would be able to understand either the << >> or the {| |} syntax. And if there are SPARQL-star engines that understand this syntax they would not retrieve any triples unless they understood this extra built-in predicate.

That is unless you are suggesting that RDF stores include or provide source annotations for all their triples.

rubensworks commented 1 year ago

That is unless you are suggesting that RDF stores include or provide source annotations for all their triples.

That could be an option, if explicit entailment would be preferred, but that's not the goal of this use case. Instead, it would only be the federation engine (not the separate SPARQL endpoints over which federation is happening) that would be aware of this :federatedSource predicate, and would interpret and process it.

I want to emphasize again that the above does not exist yet, it's simply a possible use case for quoted triples in future federated SPARQL engines. A keyword such as SOURCE would therefore be a viable alternative, but with the mentioned disadvantages.

pfps commented 1 year ago

OK, so only a SPARQL query engine that accepts requests for federated queries needs to be changed. But this can't be just something that passes the query off to a regular SPARQL query engine as it needs to have access to the underlying matches against RDF graphs.

pfps commented 1 year ago

How about an example with a federated construct query that constructs a triple annotated with a source? That seems to draw a closer connection to RDF-star and would probably be closer to the interests of working group members.

TallTed commented 1 year ago

It seems that this wish must start by radically redesigning Federated SPARQL, which today works only through the SERVICE clause, which doesn't fit any of the sample queries shown above. All SPARQL engines involved in the described queries must support this new Federated SPARQL design, or the queries will produce results that are undesirable at best.

Alternatively, something similar to what @pfps has proposed, with a federated CONSTRUCT (maybe CONSTRUCT FROM SERVICE-ish?) could allow use of the new mechanisms in a "local" SPARQL processor that understands the new CONSTRUCT and passes appropriate subqueries (possibly re-written versions of the queries found in the new SERVICE clause) to remote engines which need not support the new Federated SPARQL....

afs commented 1 year ago

This is similar to recording an observation of a triple in another graph. It is difference to the initial example here in that the triple may not be locally asserted (stored).

SELECT * WHERE {
  << ?s ?p ?o >> :source [ :sourceUrl ?source, :observedAt ?dt ] .
}

rubensworks commented 1 year ago

I realize now that I wasn't explicit about the fact that I was referring to federated SPARQL query execution that includes source selection. Concretely, this allows users to write queries without SERVICE clauses, and the federation engine autonomously determines relevant sources for each part of the query. And since the user doesn't define these SERVICE clauses manually, it is therefore relevant to enable users to obtain the source information of triples, as determined by the source selection component.

TallTed commented 1 year ago

@rubensworks — It seems to me that before anyone can do much meaningful work on "representing triple origin information" in that scenario, someone(s) must adequately specify the "federated SPARQL query execution that includes source selection [which] allows users to write queries without SERVICE clauses, and the federation engine autonomously determines relevant sources for each part of the query" from which that "triple origin information" is to be gleaned.

Of particular interest to me is how the "federation engine" is to "autonomously [determine] relevant sources for each part of the query". What do you envision as the clues in the user's query, that would allow the federation engine to determine that some part(s) of the SPARQL query should be run against serverA rather than serverB?

The best clues I know of, VoID graphs, are absent on an embarrassingly high plurality of public datasets, and generally outdated where they do exist — and even if they were present, past efforts have shown them as far from equivalent to the schema mappings available (or constructible from some number of relatively cheap queries) on most table-relational (SQL-style) DBMS, which allow for dynamic query cost optimization when joining across multiple local and/or remote tables (which we put to substantial use in Virtuoso, in its VDBMS feature, only available in Enterprise Edition).

rubensworks commented 1 year ago

in that scenario, someone(s) must adequately specify the federated SPARQL query execution that includes source selection

@TallTed This domain has been extensively studied and is well-defined within academic research. VoID descriptions are indeed one possible approach that depends on SPARQL endpoint extensions, but several zero-knowledge approaches exist that do not rely on such metadata. One such approach is as FedX (relies on ASK queries), which is even supported in commercial systems such as GraphDB: https://graphdb.ontotext.com/documentation/10.0/fedx-federation.html

pfps commented 1 year ago

@rubensworks Take a look as https://github.com/w3c/rdf-ucr/wiki/Capturing-triple-origin-in-SPARQL-star and see whether it captures your use case.

TallTed commented 1 year ago

@rubensworks — "Extensively studied and ... well-defined within academic research" does not come close to what I meant by "adequately specify", which I would have hoped you would understand in this context to mean globally standardized, through W3C or similar; unencumbered by patents, license fees, etc.; and available for for royalty-free use, interoperable implementation, and permissionless extension.

The single paper of academic research you pointed me to is not free to read, and even if it were, one paper is hardly enough for anything to be considered "extensively studied" nor "well-defined". (I did find other paths through which to download no-cost PDFs of that paper (provided here for others: [1], [2], [3]), but there are multiple dates in their footers, and I'm not certain which is actually the latest version, nor which version you intended.)

Similarly, FedX appears (after some but not exhaustive research, stifled in part by a lot of unintended collision with FedEx) to be a thing built into RDF4J, and not discussed much of anywhere not involving RDF4J.

rubensworks commented 1 year ago

Looks good to me, thanks @pfps!

lisp commented 1 year ago

one interpretation of this issue is that it concerns annotating sparql solutions rather than triples. in what sense is that not correct?

w3c / rdf-ucr