Closed Aklakan closed 3 years ago
In general: Any approach that creates rooted graph fragments should not conflict with correlated joins such as #100 For example it seems that for the use case "for each vendor (that matches certain conditions) yield 5 products" the model for sparql result sets as multitsets of bindings is still adequate.
I don't get this use case TBH... Anything that pretends RDF is something else than just triples will eventually lead to some kind of mismatch IMO.
To overcome this limitations, people have used conventions such as naming a variable ?START - or assuming the first variable in the construct query's template as the start node.
So it is not possible to save a SPARQL query to a file and later use it in an application as a template for operating on the set of resources it describes. Yes, I can add the metadata using any custom approach - but then it simply is not sparql!
I have set up a working example to demonstrate the use case using the data in publications.ttl and the query in publications.sparql
This following code example extends the standard SPARQL query model such that the graph fragments and starting points are explicitly declared in the query so that an application can immediately e.g. start displaying publications and their author lists. And note that the returned graph fragements are plain RDF and further statements can be added.
As a final remark, note that the SPARQL query uses variables in the predicate positions (its just sparql after all) - this is something graphql does not support - and yet using this approach a subsequent application-dependent schema view can be provided that immediately allows traversal of the returned graph fragment.
I hope my descriptions make the use case clearer.
I'm afraid your example is incomplete. The SPARQL query does not validate to start with. Can you please provide:
So that I can try this with Jena's CLI.
Isn't this related to #73?
The SPARQL query does not validate to start with.
My bad, the prefixes are added programmatically in the example code. I am trying to elaborate:
The DESCRIBE ?s
#73 is somewhat similar as it designates a specific variable of the query to be of special interest, but DESCRIBE would match a set of triples that is not furthere specified.
So for the example data
:pub1 a :Publication ; :hasAuthor :Anna, :Bob .
:pub2 a :Publication . # No authors given so won't match in the query below
CONSTRUCT WHERE { ?x a :Publication ; :hasAuthor ?a }
With the response in e.g. turtle:
:pub1 a Publication;
:hasAuthor :Anna, :Bob
CONSTRUCT { ?x a :Publication ; hasAuthor ?a } }
PARTITION BY ?x
ROOTED IN ?a
In XML (just for the sake of example) it might look something like this:
<sparql-partitions>
<partition>
<!-- The binding that acts as the key for the partition (may mention multiple variables) -->
<partition-key><binding name="x">:pub1</binding></partition-key>
<!-- The graph created by the construct template for all bindings that were compatible with the partition key-->
<graph>
<!-- triple content payload; actually in RDF/XML -->
:pub1 a Publication;
:hasAuthor :Anna, :Bob
</graph>
<!-- The set of RDF terms the ROOTED IN variable (or blank node) was mapped to during template instantiation of the graph -->
<roots><root>:Anna</root><root>:Bob</root></roots>
</partition>
</sparql-partitions>
In JSON
{
"partitions": [{
"key": {"s": {"@id": ":pub1" } },
"graph": "the graph probably in JSON-LD",
"roots": [{ "@id": ":Anna"}, { "@id": ":Bob"}]
}]
}
Conceptually this is a post processing of a set of solution bindings just like CONSTRUCT - so I see it as an extension of it.
Applications that understand the result format are enabled to iterate the root nodes and access their properties within the graph fragment.
I gave a bit more thought about how my proposal would work with #73 because it would allow multiple CONSTRUCT WHERE clauses. Because then one would want to partition across several WHERE clauses.
This would be possible by essentially stating "partition by the values for variables ?x ?y for this WHERE clause but interleave/merge them with all others if ?x was renamed to ?a and ?y to ?b".
:foobar
:v "A"
:w "B"
:x "A"
:y "B"
CONSTRUCT {?r a :Foo }
ROOTED IN ?r
WHERE { ?r :v ?v ; :w ?w }
PARTITION BY ?v ?w INTERLEAVE ?a ?b
CONSTRUCT {?s a :Bar }
ROOTED IN ?s
WHERE { ?s :x ?x ; :y ?y }
PARTITION BY ?x ?y INTERLEAVE ?a ?b
The result is a single partition
[{
"key": {"a": {"type": "literal", "value": "A"}, "b": {"type": "literal", "value": "B" }},
"graph": ":foobar a :Foo, :Bar",
"roots": [{"type": "uri", "value": ":foobar"}]
}]
Execution could then group all where clauses that are interleaved on the same variables apply order by on the partition variables and perform a merge of the bindings (like in a merge sort) and create the partitions.
Without INTERLEAVE each WHERE clause would produce its partitions independently (i.e. no interleaving of the bindings with any other where clause)
What is still missing is specification of an order of partitions. For example give me for each publication the rooted partitions sorted by the publication year. This could be accomplished by specifying a sort key for the partitions.
So for a single construct query this straight forward:
Given this data
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
eg:pub1 a dct:BibliographicResource ; dct:created "2015-10-10"^^xsd:date .
eg:pub2 a dct:BibliographicResource ; dct:created "2018-03-03"^^xsd:date .
and this query
CONSTRUCT WHERE { ?x a dct:BibliographicResource ; dct:created ?date }
ROOTED IN ?x
PARTITION BY ?x
ORDER PARTITIONS BY DESC(MAX(?date))
(an aggregation expression is needed to resolve multiple values for a sort key within a partition)
the translation would be: (Extra variables inserted for clarity; the name ?partitionVar1 should indicate that in general there could more than 1 partition variable)
SELECT DISTINCT ?sortKey ?partitionVar1 ?x ?date {
{ SELECT ?partitionVar1 (MAX(?date) AS ?sortKey) {
{ ?x a dct:BibliographicResource ; dct:created ?date }
BIND(?x AS ?partitionVar1)
} GROUP BY ?partitionVar1 }
# CORRELATED JOIN here perhaps?
{ ?x a dct:BibliographicResource ; dct:created ?date }
FILTER(?x = ?partitionVar1)
} ORDER BY DESC(?sortKey) ?partitionVar1
With the bindings
(?sortKey="2018-03-03"^^xsd:date, ?partitionVar1=:pub2, ?x=:pub2, ?date="2018-03-03"^^xsd:date)
(?sortKey="2015-10-10"^^xsd:date, ?partitionVar1=:pub2, ?x=:pub1, ?date="2015-10-10"^^xsd:date)
Now one just has to interate all consecutive bindings with the same values for the partition variables for each sortkey and instantiate the construct template.
W.r.t. #73 the sort key declaration in the example above would have to change a bit so that for each WHERE clause the sort key could be stated independently. Also, in order allow uniform ordering over multiple partitions I think the sort key would have to be restricted to a single expression - i.e. the evaluation of the sort key is a single concrete rdf term.
CONSTRUCT WHERE { ?x a dct:BibliographicResource ; dct:created ?date }
ROOTED IN ?x
PARTITION BY ?x
SORTKEY ?date
ORDER PARTITIONS BY DESC(MAX(SORTKEY))
Because then it would be possible to generalize partitioning over multiple involved result sets:
CONSTRUCT WHERE { ?x a dct:BibliographicResource ; dct:created ?date }
ROOTED IN ?a
PARTITION BY ?x INTERLEAVE ?int SORTKEY ?date
CONSTRUCT WHERE { ?foo ... ?bar }
ROOTED IN ?foo
PARTITION BY ?foo INTERLEAVE ?int SORTKEY ?bar
ORDER PARTITIONS DESC(MAX(SORTKEY))
This would expand to the following SPARQL1.1 query
SELECT DISTINCT ?sortKey ?int ?x ?data ?foo ?bar
WHERE {
{ SELECT ?int (MAX(?sortKeyContrib) AS ?sortKey)) {
{ ?x a dct:BibliographicResource ; dct:created ?date }
BIND(?x AS ?int)
BIND(?date AS ?sortKeyContrib) }
UNION
{ { ?foo ... ?bar }
BIND(?foo AS ?int)
BIND(?bar AS ?sortKeyContrib) }
} GROUP BY ?int }
{
{ { ?x a dct:BibliographicResource ; dct:created ?date }
BIND(?x AS ?int) }
}
UNION
{ { ?foo ... ?bar }
BIND(?x AS ?int) }
}
} ORDER BY DESC(?sortKey) ?int
This would now yield the bindings in an order that aggregation of partitions just has to consider consecutive bindings. Things however get pretty ugly when the partition keys have different lengths.
I removed SELECT from the title because even thought the partitioning and sorting happens on the binding level, the concept as a whole is only useful the context of CONSTRUCT queries.
The use case that @Aklakan is trying to describe is quite a simple one and I think it will be useful for Construct queries. I will try to simplify so everybody can understand.
Imagine you have a simple Construct query with very simple BGP.
Construct {?s :whatever ?o. ?o :something ?y} ...
The problem in the query above is that you can not guarantee in which order the second triple pattern will appear in the result set. It may be in the last position (Imagine a result set with 1M entries). With the specification that we have today a user may have to fetch the entire result set to be able to use the information on this graph pattern. In other words, a user may be interested in iterate through graph fragments built using construct queries. Notice that sorting by one of those variables will not solve the problem.
e.g. sorting by ?s may place the triples starting with ?o at the end of the result set, and vice versa.
This problem does not occur with Select queries as you can control the binding variables in the projected view.
Apologies; I rephrased the idea in #128 which should be much clearer in presentation and a better base for discussion.
Why?
The SPARQL CONSTRUCT query form is triple-centric. SPARQL so far lacks a resource centric query form. Resource centric views provide an explicit starting point for traversal of an RDF graph fragment and ease the mapping to object graphs (assuming the starting resource corresponds to a starting object). Thereby a partitioning mechanism should act as a middle ground between a single binding and a complete result set so that sets of bindings that belong together can be converted to a graph fragment.
Previous work
I am not aware of any systematic approach; many have developed custom solutions as workarounds for this problem. E.g. use the first variable in the subject position as the starting point or provide the starting point externally.
Proposed solution
There are 3 concepts involved
In syntax it might look like this:
The 'result set' would for each partition hold the graph fragment and the set of nodes in that fragment that serve as starting points. Possibly also the partition key as a binding:
In contrast to e.g. GraphQL this approach does allow for the use of variables in predicate positions when specifying the graph fragments to extract.
I have sketched the approach with more pratical implications in this document.
Considerations for backward compatibility
The described approach introduces new keywords that extend the SPARQL1.1 query model with control for creating rooted graph fragments - there is no change in existing semantics.