Extended CONSTRUCT query forms to declare graph fragments with a starting point

Aklakan commented 3 years ago

Why?

The SPARQL CONSTRUCT query form is triple-centric. SPARQL so far lacks a resource centric query form. Resource centric views provide an explicit starting point for traversal of an RDF graph fragment and ease the mapping to object graphs (assuming the starting resource corresponds to a starting object). Thereby a partitioning mechanism should act as a middle ground between a single binding and a complete result set so that sets of bindings that belong together can be converted to a graph fragment.

Previous work

I am not aware of any systematic approach; many have developed custom solutions as workarounds for this problem. E.g. use the first variable in the subject position as the starting point or provide the starting point externally.

Proposed solution

There are 3 concepts involved

partitioning of a select result sets by a given set of variables in order to group together all bindings/triples belonging to a resource's graph fragment.
designating an rdf term of a construct template as a root (a starting point for traversal of the graph fragmenet)
declaring composite keys for blank nodes in terms of variables in the WHERE clause. If no declaration is given, a fresh blank node is allocated for each binding just like in SPARQL 1.1

In syntax it might look like this:

CONSTRUCT {
  _:X a sosa:Observation ;
    :city ?city ;
    :day ?day ;
    :avgTemp ?avgTemp .
    .
  ?city rdfs:label ?cityLabel .
}
KEY _:X (?city ?day) # allocate the same bnode for the same composite key
WHERE {
 { SELECT ?city ?day (AVG(?temperature) AS ?avgTemp) { ... } }
 ?city rdfs:label ?cityLabel
}
PARTITION BY (?city ?day)
ROOTED IN _:X # single rdf term occurring in the template and/or the where pattern (template may be empty)

The 'result set' would for each partition hold the graph fragment and the set of nodes in that fragment that serve as starting points. Possibly also the partition key as a binding:

Key (?city day)	Graphs (1 per partition)	Roots (sequence of RDF terms)
(?city = :Leipzig, ?day = 30)	_:leipzig-30 :city :Leipzig; :day 30; :avgTemp 10	(_:leipzig-30)

In contrast to e.g. GraphQL this approach does allow for the use of variables in predicate positions when specifying the graph fragments to extract.

I have sketched the approach with more pratical implications in this document.

Considerations for backward compatibility

The described approach introduces new keywords that extend the SPARQL1.1 query model with control for creating rooted graph fragments - there is no change in existing semantics.

Aklakan commented 3 years ago

In general: Any approach that creates rooted graph fragments should not conflict with correlated joins such as #100 For example it seems that for the use case "for each vendor (that matches certain conditions) yield 5 products" the model for sparql result sets as multitsets of bindings is still adequate.

namedgraph commented 3 years ago

I don't get this use case TBH... Anything that pretends RDF is something else than just triples will eventually lead to some kind of mismatch IMO.

Aklakan commented 3 years ago

With OWL I can create a class expression that intensionally describes a set of resources.
With SPARQL I can create a construct query that describes a set of triples - but there is no way to express in the sparql query itself what the set of matching resources is, for which the graphs are returned.

To overcome this limitations, people have used conventions such as naming a variable ?START - or assuming the first variable in the construct query's template as the start node.

So it is not possible to save a SPARQL query to a file and later use it in an application as a template for operating on the set of resources it describes. Yes, I can add the metadata using any custom approach - but then it simply is not sparql!

I have set up a working example to demonstrate the use case using the data in publications.ttl and the query in publications.sparql

This following code example extends the standard SPARQL query model such that the graph fragments and starting points are explicitly declared in the query so that an application can immediately e.g. start displaying publications and their author lists. And note that the returned graph fragements are plain RDF and further statements can be added.

MainEntityQuery.java

As a final remark, note that the SPARQL query uses variables in the predicate positions (its just sparql after all) - this is something graphql does not support - and yet using this approach a subsequent application-dependent schema view can be provided that immediately allows traversal of the returned graph fragment.

I hope my descriptions make the use case clearer.

namedgraph commented 3 years ago

I'm afraid your example is incomplete. The SPARQL query does not validate to start with. Can you please provide:

a working query
the current result, if any
the desired result

So that I can try this with Jena's CLI.

namedgraph commented 3 years ago

Isn't this related to #73?

Aklakan commented 3 years ago

The SPARQL query does not validate to start with.

My bad, the prefixes are added programmatically in the example code. I am trying to elaborate:

The DESCRIBE ?s #73 is somewhat similar as it designates a specific variable of the query to be of special interest, but DESCRIBE would match a set of triples that is not furthere specified.

So for the example data

:pub1 a :Publication ; :hasAuthor :Anna, :Bob . 

:pub2 a :Publication . # No authors given so won't match in the query below

What works now is to do

CONSTRUCT WHERE { ?x a :Publication ; :hasAuthor ?a }

With the response in e.g. turtle:

:pub1 a Publication;
    :hasAuthor :Anna, :Bob

For a query such as the one below the response would have to be a set of graph partitions. Let b be a SPARQL binding, G an RDF graph and T a set of concrete RDF terms (iris U bnodes U literals) then a single partition is a tree tuple of (b, G, T) as shown below:

CONSTRUCT { ?x a :Publication ; hasAuthor ?a } }
PARTITION BY ?x
ROOTED IN ?a

In XML (just for the sake of example) it might look something like this:

<sparql-partitions>
  <partition>
    <!-- The binding that acts as the key for the partition (may mention multiple variables) -->
    <partition-key><binding name="x">:pub1</binding></partition-key>

    <!-- The graph created by the construct template for all bindings that were compatible with the partition key-->
    <graph>
      <!-- triple content  payload; actually in RDF/XML -->
        :pub1 a Publication;
           :hasAuthor :Anna, :Bob 
   </graph>

    <!-- The set of RDF terms the ROOTED IN variable (or blank node) was mapped to during template instantiation of the graph -->
    <roots><root>:Anna</root><root>:Bob</root></roots>
  </partition>
</sparql-partitions>

In JSON

{
  "partitions": [{
    "key": {"s": {"@id": ":pub1" } },
    "graph": "the graph probably in JSON-LD",
    "roots": [{ "@id": ":Anna"}, { "@id": ":Bob"}]
  }]
}

Conceptually this is a post processing of a set of solution bindings just like CONSTRUCT - so I see it as an extension of it.

Applications that understand the result format are enabled to iterate the root nodes and access their properties within the graph fragment.

Aklakan commented 3 years ago

I gave a bit more thought about how my proposal would work with #73 because it would allow multiple CONSTRUCT WHERE clauses. Because then one would want to partition across several WHERE clauses.

This would be possible by essentially stating "partition by the values for variables ?x ?y for this WHERE clause but interleave/merge them with all others if ?x was renamed to ?a and ?y to ?b".

:foobar
  :v "A"
  :w "B"
  :x "A"
  :y "B"

CONSTRUCT {?r a :Foo }
ROOTED IN ?r
WHERE { ?r :v ?v ; :w ?w }
PARTITION BY ?v ?w INTERLEAVE ?a ?b

CONSTRUCT {?s a :Bar }
ROOTED IN ?s
WHERE { ?s :x ?x ; :y ?y }
PARTITION BY ?x ?y INTERLEAVE ?a ?b

The result is a single partition

[{
  "key": {"a": {"type": "literal", "value": "A"}, "b": {"type": "literal", "value": "B" }},
  "graph": ":foobar a :Foo, :Bar",
  "roots": [{"type": "uri", "value": ":foobar"}]
}]

Execution could then group all where clauses that are interleaved on the same variables apply order by on the partition variables and perform a merge of the bindings (like in a merge sort) and create the partitions.

Without INTERLEAVE each WHERE clause would produce its partitions independently (i.e. no interleaving of the bindings with any other where clause)

Aklakan commented 3 years ago

What is still missing is specification of an order of partitions. For example give me for each publication the rooted partitions sorted by the publication year. This could be accomplished by specifying a sort key for the partitions.

So for a single construct query this straight forward:

Given this data

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
eg:pub1 a dct:BibliographicResource ; dct:created "2015-10-10"^^xsd:date .
eg:pub2 a dct:BibliographicResource ; dct:created "2018-03-03"^^xsd:date .

and this query

CONSTRUCT WHERE { ?x a dct:BibliographicResource ; dct:created ?date }
ROOTED IN ?x
PARTITION BY ?x
ORDER PARTITIONS BY DESC(MAX(?date))

(an aggregation expression is needed to resolve multiple values for a sort key within a partition)

the translation would be: (Extra variables inserted for clarity; the name ?partitionVar1 should indicate that in general there could more than 1 partition variable)


SELECT DISTINCT ?sortKey ?partitionVar1 ?x ?date {
  { SELECT ?partitionVar1 (MAX(?date) AS ?sortKey) {
    { ?x a dct:BibliographicResource ; dct:created ?date }
    BIND(?x AS ?partitionVar1)
  } GROUP BY ?partitionVar1 }

  # CORRELATED JOIN here perhaps?
  { ?x a dct:BibliographicResource ; dct:created ?date }
  FILTER(?x = ?partitionVar1) 

} ORDER BY DESC(?sortKey) ?partitionVar1

With the bindings

(?sortKey="2018-03-03"^^xsd:date, ?partitionVar1=:pub2, ?x=:pub2, ?date="2018-03-03"^^xsd:date)
(?sortKey="2015-10-10"^^xsd:date, ?partitionVar1=:pub2, ?x=:pub1, ?date="2015-10-10"^^xsd:date)

Now one just has to interate all consecutive bindings with the same values for the partition variables for each sortkey and instantiate the construct template.

W.r.t. #73 the sort key declaration in the example above would have to change a bit so that for each WHERE clause the sort key could be stated independently. Also, in order allow uniform ordering over multiple partitions I think the sort key would have to be restricted to a single expression - i.e. the evaluation of the sort key is a single concrete rdf term.

CONSTRUCT WHERE { ?x a dct:BibliographicResource ; dct:created ?date }
ROOTED IN ?x
PARTITION BY ?x
SORTKEY ?date
ORDER PARTITIONS BY DESC(MAX(SORTKEY))

Because then it would be possible to generalize partitioning over multiple involved result sets:

CONSTRUCT WHERE { ?x a dct:BibliographicResource ; dct:created ?date }
ROOTED IN ?a
PARTITION BY ?x INTERLEAVE ?int SORTKEY ?date

CONSTRUCT WHERE { ?foo ... ?bar }
ROOTED IN ?foo
PARTITION BY ?foo INTERLEAVE ?int SORTKEY ?bar
ORDER PARTITIONS DESC(MAX(SORTKEY))

This would expand to the following SPARQL1.1 query

SELECT DISTINCT ?sortKey ?int ?x ?data ?foo ?bar
WHERE {
  { SELECT ?int (MAX(?sortKeyContrib) AS ?sortKey)) {
      { ?x a dct:BibliographicResource ; dct:created ?date }
        BIND(?x AS ?int)
        BIND(?date AS ?sortKeyContrib) }
    UNION
      { { ?foo ... ?bar }
         BIND(?foo AS ?int)
         BIND(?bar AS ?sortKeyContrib) }
  } GROUP BY ?int }

  {
      { { ?x a dct:BibliographicResource ; dct:created ?date }
        BIND(?x AS ?int) }
      }
    UNION
      { { ?foo ... ?bar }
        BIND(?x AS ?int) }
  }
} ORDER BY DESC(?sortKey) ?int

This would now yield the bindings in an order that aggregation of partitions just has to consider consecutive bindings. Things however get pretty ugly when the partition keys have different lengths.

Aklakan commented 3 years ago

I removed SELECT from the title because even thought the partitioning and sorting happens on the binding level, the concept as a whole is only useful the context of CONSTRUCT queries.

edgardmarx commented 3 years ago

The use case that @Aklakan is trying to describe is quite a simple one and I think it will be useful for Construct queries. I will try to simplify so everybody can understand.

Imagine you have a simple Construct query with very simple BGP.

Construct {?s :whatever ?o. ?o :something ?y} ...

The problem in the query above is that you can not guarantee in which order the second triple pattern will appear in the result set. It may be in the last position (Imagine a result set with 1M entries). With the specification that we have today a user may have to fetch the entire result set to be able to use the information on this graph pattern. In other words, a user may be interested in iterate through graph fragments built using construct queries. Notice that sorting by one of those variables will not solve the problem.

e.g. sorting by ?s may place the triples starting with ?o at the end of the result set, and vice versa.

This problem does not occur with Select queries as you can control the binding variables in the projected view.

Aklakan commented 3 years ago

Apologies; I rephrased the idea in #128 which should be much clearer in presentation and a better base for discussion.

w3c / sparql-dev