Open VladimirAlexiev opened 5 years ago
Related: "Query hints" #71
https://www.stardog.com/labs/blog/stored-query-service/ explains how to call stored queries in Stardog (similar to subqueries but providing better modularity). It also includes ideas how to use a similar mechanism for correlated (LATERAL) joins and cites this issue.
@kasei, @afs, @jeenbroekstra and @JervenBolleman:
@rubensworks Do you agree with Onto that this issue is quite crucial for implementing GraphQL fully and correctly?
Any example where this would be relevant that is not related to GraphQL?..
An example how a correlated join is a useful addition to un-correlated ones? It's exactly the same situation as in SQL and there're lots of use cases, the classical one is something like "you have a database of products and a set of offers for each product, you want to get 5 highest price offers for each product".
GraphQL is one important use case here but this is a core SPARQL extension proposal which is perfectly valid outside of the GraphQL context.
I think this issue is interesting enough to do a SEP on. Reading through this LATERAL
proposal, I suspect that it overlaps with the PREFER
SEP (proposed for #13) and maybe that can be extended. The syntax/usecase is different, but execution wise I suspect the overlap is much higher. As in PREFER
is a LATERAL
query that must have an ORDER BY
and fixed LIMIT
of 1. Which I had not thought of, and considering I only had my first coffee I am willing to be corrected on ;)
Do you agree with Onto that this issue is quite crucial for implementing GraphQL fully and correctly?
I fully agree on the importance of this issue for SPARQL in general.
I wouldn't call it crucial for GraphQL per se, as different interpretations are possible on how to apply it to RDF data. But for the interpretation that Ontotext has picked, it definitely makes sense.
ignore sql machinations related to lateral and correlated subqueries, to concentrate just on the intent to produce from some dependent sub-expression some number of solutions for each solution of some dominant subexpression.
should not the general "sidewards information passing" mechanism suffice?
is this best accommodated with a union variant?
is it not more a matter of just declaring top-down rather than bottom-up reduction - analogous to the conflagration surrounding EXISTS? in which case an explicit SUBSELECT operator and/or something like an EACH modifier would suffice to declare intent and leave it to the processor to arrange for the s-i-p data/control flow in any and all join forms. as in (in simplified form)
select * {
{select * where { ?country x:city ?city. } limit 2}
{SUBSELECT * where { ?city x:company ?company. } limit 2}
}
or
select * {
{select * where { ?country x:city ?city. } limit 2}
{SELECT * where { ?city x:company ?company. } LIMIT (EACH 2) }
}
@lisp from the perspective of a SPARQL user (rather than implementer) it would seem more intuitive to have a somewhat bottom-up phrasing for this. Based on your example:
select *
where { ?city x:company ?company. }
limit 2
for each {
select *
where { ?country x:city ?city. }
limit 2
}
given the implicit control flow which applies to the rest of the language, this would be
for each {
select *
where { ?country x:city ?city. }
limit 2 }
{ select *
where { ?city x:company ?company. }
limit 2 }
https://medium.com/agnos-ai/sql-features-not-supported-in-sparql-1-1-eb34e3519077#7d71 AGNOS blog giving some examples, but in SQL using the "Northwind" MS SQL example database
@lisp We need a syntax showing the dependency (major-minor) explicitly (as in your https://github.com/w3c/sparql-12/issues/100#issuecomment-858171838).
A symmetric syntax like your https://github.com/w3c/sparql-12/issues/100#issuecomment-705485097 won't work because:
@jaw111 Please let's not spell it minor-major. (Do you write RTL by any chance? ;-)
Here's an example of deeper nesting, and each level has a dependency on the previous level:
@klinovp on twitter: We added correlated subqueries to SPARQL: https://docs.stardog.com/query-stardog/stored-query-service#correlated-subqueries It's similar to LATERAL in Postgres.
what are the circumstances that require the sqs:inputs
mechanism?
@klinovp no RTL experience here.
Looking back to the original proposed solution, I struggle with the UNION clause, but could that simply be omitted and use a syntax for LATERAL something of a mash-up of VALUES and FILTER EXISTS to indicate the variable(s) for which bindings are to be passed sideways/laterally to the inner pattern/subquery.
select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
{
select ?country ?city ?city_name ?population {
bind(<http://sws.geonames.org/732770/> as ?country)
?country x:city ?city.
?city x:name ?city_name.
?city x:population ?population.
}
order by desc(?population)
limit 2
}
LATERAL ?city {
select ?city ?company ?company_name ?revenue {
?city x:company ?company.
?company x:name ?company_name.
?company x:revenue ?revenue.
}
order by desc(?revenue)
limit 2
}
}
One could also pass in multiple variables and make the minor part optional. The outer part need not be a subquery:
select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
bind(<http://sws.geonames.org/732770/> as ?country)
?country x:city ?city.
?city x:name ?city_name.
?city x:population ?population.
optional {
LATERAL ?city ?country {
select ?city ?company ?company_name ?revenue {
?city x:company ?company.
?company x:name ?company_name.
?company x:revenue ?revenue.
?company x:headquarteredIn ?country .
}
order by desc(?revenue)
limit 2
}
}
}
Similarly, what is in the lateral might just be a graph pattern rather than a subquery. This could be used to give hints to the query processor whether to pass in solutions it already has when evaluating an optional clause (nest loop join) or use bottom up approach.
select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
bind(<http://sws.geonames.org/732770/> as ?country)
?country x:city ?city.
?city x:name ?city_name.
?city x:population ?population.
optional {
LATERAL ?city ?country {
?city x:company ?company.
?company x:name ?company_name.
?company x:revenue ?revenue.
?company x:headquarteredIn ?country .
}
}
}
@jaw111 thanks for your syntax proposals!
1) If the main is not a subquery, would we put order/limit at the bottom as for a normal query?
select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
bind(<http://sws.geonames.org/732770/> as ?country)
?country x:city ?city.
?city x:name ?city_name.
?city x:population ?population.
optional {
LATERAL ?city ?country {
select ?city ?company ?company_name ?revenue {
?city x:company ?company.
?company x:name ?company_name.
?company x:revenue ?revenue.
?company x:headquarteredIn ?country .
}
order by desc(?revenue)
limit 2
}
}
} order by desc(?population) limit 2
2) We should always think of the case of multiple nestings:
major
minor
sub-minor
So I'd favor syntaxes that reflect that nesting in the query structure.
the same question applies to the LATERAL proposal as to sqs:inputs: what are the circumstances which require the declaration for the corresponding variables?
@lisp minor should execute for each solution of major. If they are merely patterns (not subqueries) then I think you are right, it's not necessary to declare the lateral variables. If they are subqueries, they declare their exposed variables, and again it's not necessary to declare them in LATERAL.
For the just released Jena 4.6.0 I contributed the "service enhancer" plugin which has lateral joins as one of its features: https://jena.apache.org/documentation/query/service_enhancer.html It's the first release so consider it experimental.
Example: Fetch one English label for every entity:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?s ?l {
VALUES ?s { wd:Q1686799 wd:Q54837 wd:Q54872 wd:Q54871 wd:Q108379795 }
SERVICE <loop:https://query.wikidata.org/sparql> { # SERVICE <loop:> {...} lateral joins with active dataset
SELECT ?l {
?s rdfs:label ?l
FILTER(langMatches(lang(?l), 'en'))
} ORDER BY ?l LIMIT 1
}
}
right, it's not necessary to declare the lateral variables. If they are subqueries, they declare their exposed variables, and again it's not necessary to declare them in LATERAL.
That was also my finding that its not necessary to declare the variables. In principle, lateral joins require different scoping rules as described in the linked document.
Question: could someone comment on the semantics of multiple / consecutive lateral joins? Does lateral join n extend lateral join n-1? Or would this be based on shared variables?
An example use case could be to
The key here is that the two 'lateral clauses' are meant to be unrelated to each other and only be related to the top-level query.
@frensjan If I understand @afs proposal properly, LATERAL
works just like OPTIONAL
i.e. variables bound inside of the first LATERAL
clause will be considered when evaluating the second LATERAL
clause. To use SPARQL SSE notation, X LATERAL { Y } LATERAL { Z }
is parsed to (lateral (lateral X Y) Z)
@frensjan - interesting example.
In the current proposal, it is as @Tpt describes. The second lateral takes the results from the first as input. With no shared variables, that is in effect a cross-product.
What shape of result set are you trying to produce? there can be multiple lines for the same?thread
, ?title
pairs in each lateral.
The description feels like it is a an "details for A, details for B" (A and B themselves not connected) which suggests a UNION
somewhere - that does not get information from each branch on the same row. But if there are 5 rows from the first lateral, for one (thread,title) what is the desired outcome for the 3 rows from the second for that (thread,title)? There isn't a nested table feature in SPARQL - GROUP_CONCAT can be used but it isn't ideal.
Thanks for the responses @afs and @Tpt; understood and as I expected. For as long as the size of the 'legs' is small, the costs of such a cross-product could be acceptable.
The use case would be to show related information in an entity listing / table or in an entity detail view. I.e. the name and number of posts for the OP and the number of posts per thread of a listing of threads.
Using a pseudo GraphQL query this could be something like:
{
threads(limit:10) {
title
creationDate
postCount # lateral join to count(?post) over ?post :parent ?thread
creator { # lateral join to get ?creator :created ?thread
name # fetch ?creator :name ?name
postCount # count(?post) over ?creator :created ?post
}
}
}
Use cases where information is tracked over time it would be useful to show the latest value for a number of attributes; e.g. based on a timestamp associated with the graph such triple occurs in. If the application can assume these attributes change values atomically in this example in the same named graph, this could be solved by one lateral join to the named graph with the last changes. But reality can be stubborn in my experience.
GROUP_CONCAT
would only work for primitive values and not for more elaborate (sub-)solution. Ideally, this could be combined with the FRAMED ideas that are also discussed for SPARQL 1.2.
Note that while I'm using GraphQL here as examples, it is intended as pseudo code / a means to convey intent. I think these use-cases hold regardless of whether GraphQL is part of the technology stack.
with each re-reading, i am a more convinced that the operator at issue should be interpreted in combination with the same solution state as any other operator and affect just the binding environment. it should have no special effect on the algebra. this could simplify both its implementation and its comprehension.
any of the following should be possible
{ ... } optional { lateral x y z { ... } }
{ ... } lateral x y z { { ...} optional { ... } }
{ ... } union { lateral x y z { } }
the combinations are those of the existing sparql operators.
in the above, the second denotes a join, as that is sparql's implicit operator.
in a manner similar to how "select" describes just projection, "lateral" (or whatever it is called) describes just an initial environment.
@lisp that is very interesting, would you mind proposing this as a pull request on 0006 or even as an alternative SEP if you feel that would be easier?
@lisp Some examples would help and I may have understood your suggestion.
Lateral join is the name used in SQL but it isn't really a join as in "evaluate left and right to a table, then combine the tables". I find the name counter-intuitive, but the terminology is already out there.
Is that use of "lateral" is the substitute
operator?
https://github.com/w3c/sparql-12/blob/main/SEP/SEP-0006/sep-0006.md#evaluation-1
pattern2 = inject(pattern, μ1)
Ω1 = eval(D(G), pattern2)
What would happen to the variable in the RHS that are not in the "lateral xyz"? Is the substitution deep or shallow in "lateral x y z { { ...} optional { ... } }"?
Here are three queries that might be used in the GraphQL case (if it's one query for the whole GraphQL request) on the data
:s1 a :T .
:s1 :p "s1-p-1" , "s1-p-2" , "s1-p-3" .
:s1 :q "s1-q-1" , "s1-q-2" , "s1-q-3" .
:s2 a :T .
:s2 :p "s2-p-1" , "s2-p-2" , "s2-p-3" .
:s2 :q "s2-q-1" , "s2-q-2" ,"s2-q-3" .
The first is two laterals, each setting an independent variable p
and q
, and run one after another. The effect is repetition. For a value of p
, there two rows of q
.
SELECT * {
?s a :T .
LATERAL { SELECT * { ?s :p ?p } ORDER BY ?p LIMIT 2 }
LATERAL { SELECT * { ?s :q ?q } ORDER BY ?q LIMIT 2 }
}
-----------------------------
| s | p | q |
=============================
| :s2 | "s2-p-1" | "s2-q-1" |
| :s2 | "s2-p-1" | "s2-q-2" |
| :s2 | "s2-p-2" | "s2-q-1" |
| :s2 | "s2-p-2" | "s2-q-2" |
| :s1 | "s1-p-1" | "s1-q-1" |
| :s1 | "s1-p-1" | "s1-q-2" |
| :s1 | "s1-p-2" | "s1-q-1" |
| :s1 | "s1-p-2" | "s1-q-2" |
-----------------------------
The second has a UNION
SELECT * {
?s a :T .
{ LATERAL { SELECT * { ?s :p ?p } ORDER BY ?p LIMIT 2 } }
UNION
{ LATERAL { SELECT * { ?s :q ?q } ORDER BY ?q LIMIT 2 } }
}
-----------------------------
| s | p | q |
=============================
| :s2 | "s2-p-1" | |
| :s2 | "s2-p-2" | |
| :s2 | | "s2-q-1" |
| :s2 | | "s2-q-2" |
| :s1 | "s1-p-1" | |
| :s1 | "s1-p-2" | |
| :s1 | | "s1-q-1" |
| :s1 | | "s1-q-2" |
-----------------------------
The third is similar to second except it is the same variable being set in each LATERAL and the two LATERAL blocks put in a marker value.
SELECT * {
?s a :T .
{ LATERAL { SELECT * { ?s :p ?o } ORDER BY ?o LIMIT 2 } BIND("A" AS ?x) }
UNION
{ LATERAL { SELECT * { ?s :q ?o } ORDER BY ?o LIMIT 2 } BIND("B" AS ?x) }
}
------------------------
| s | o | x |
========================
| :s2 | "s2-p-1" | "A" |
| :s2 | "s2-p-2" | "A" |
| :s2 | "s2-q-1" | "B" |
| :s2 | "s2-q-2" | "B" |
| :s1 | "s1-p-1" | "A" |
| :s1 | "s1-p-2" | "A" |
| :s1 | "s1-q-1" | "B" |
| :s1 | "s1-q-2" | "B" |
------------------------
I get the intuition behind the use of UNION here. Would this be valid syntax given incorporation of SEP 0006? Or this to be taken as a proposal to extend SEP 0006?
However, I can't grasp how this would translate to the SPARQL algebra. I get the impression - from SPARQL 1.1. spec, the examples in the SPARQL spec on combinations of UNION and OPTIONAL as analogy and the changes from SEP 0006 - that the second lateral would be joined with the empty graph pattern instead of with the ?s a :T
. I.e.:
(union
(lateral
(bgp (triple ?s a :T))
( ... subselect ))
(lateral
(Z) # the empty BGP
( ... subselect )))
I don't see how the ?s a :T
pattern would be distributed over the lateral clauses.
The following expression doesn't have this issue, but this would yield really large expressions/queries if the queries to be laterally joined to would be of any real-world interest, this isn't very user friendly, the redundancy in the query would be an obvious point for errors, and it is perhaps harder for query optimisers to hoist the repeated GroupGraphPattern out of the union.
(union
(lateral
(bgp (triple ?s a :T))
( ... subselect ))
(lateral
(bgp (triple ?s a :T))
( ... subselect )))
From an algebra point of view I guess it would be ideal if to define lateral as an expression with one or more arguments, where arguments 2 and beyond would be evaluated for solutions from the evaluation of the first argument. E.g. to construct expressions such as:
(lateral
(bgp (triple ?s a :T)) # basic graph pattern to laterally join to
( ... subselect 1 ) # laterally joined with arg 1
( ... subselect 2 )) # laterally joined with arg 1
However, having LATERAL as an infix keyword, I don't see a straightforward grammar for this. Or is that your intent for the union of two lateral graph patterns to result in just that?
Hi @frensjan
Yes - you're correct. There must be a bug at my end.
The second query with the union should be
SELECT * {
?s a :T .
LATERAL {
{ SELECT * { ?s :p ?p } ORDER BY ?p LIMIT 2 }
UNION
{ SELECT * { ?s :q ?q } ORDER BY ?q LIMIT 2 }
}
}
The algebra is:
(lateral
(bgp (triple ?s rdf:type :T)) # LHS of LATERAL
(union # RHS of LATERAL
(slice _ 2
(order (?p) (bgp (triple ?s :p ?p))))
(slice _ 2
(order (?q) (bgp (triple ?s :q ?q))))))
to give the shown results for query 2.
It does avoid cross product growth.
It would need another ORDER BY ?p ?q ?s
on the overall query to group the ?p
and ?q
together.
-----------------------------
| s | p | q |
=============================
| :s1 | | "s1-q-1" |
| :s1 | | "s1-q-2" |
| :s2 | | "s2-q-1" |
| :s2 | | "s2-q-2" |
| :s1 | "s1-p-1" | |
| :s1 | "s1-p-2" | |
| :s2 | "s2-p-1" | |
| :s2 | "s2-p-2" | |
-----------------------------
example2.zip has fixed the version.
Good example to add to a test suite.
What would happen to the variable in the RHS that are not in the "lateral xyz"? Is the substitution deep or shallow in "lateral x y z { { ...} optional { ... } }"?
"lateral" is a situation similar to "exists".
with this approach - contrary to my earlier belief, the "lateral" form should declare the environment's variables. these provide initial bindings for variables in any bgp forms and any otherwise free variables in its scope/extent.
if one feels that the tests in
https://github.com/apache/jena/tree/main/jena-arq/testing/ARQ/Lateral
are adequate, i will go ahead and implement "lateral" based on the same mechanism which we use for "exists" and report results for those tests.
"dynamic binding mechanism" and "injection" can be viewed as implementations of each other.
"dynamic binding mechanism" could be an implementation of SEP-0007 - lookup a variable name in the evaluated LHS. The difference with SEP-0007 is only that SEP-0007 formalizes the outcome as algebra by the injection step that puts the value immediately next to the variable use.
"dynamic binding mechanism" and "injection" can be viewed as implementations of each other.
this is not true.
the former concerns environments while the latter - even in the later formulations such as sep0007, concerns forms.
its problems are inherent in that approach.
as discussed, at length, during the correspondence months ago regarding exists, a mechanism based on environments does not introduce the issues which sep0007 endeavours to resolve.
Then I don't understand what you mean by "dynamic binding mechanism" if it is not as I described. "injection" is applied per row, not statically during algebra generation, so in that sense it is dynamic.
Could you give an example of the difference please?
one example is the way blank nodes are manipulated. one aspect of the "injection" discussion concerns how to manage blank nodes which are introduced into a bgp. where the operation occurs in the expression domain, one has to realize that that blank nodes are not to be treated as variables. where the operation is performed by binding variables in an environment, the issues can not arise as the domain is that of values, not lexical expressions.
Injection in SEP-0007 leaves variables in-place. It does not introduce a blank node into a BGP. The original substitute
did do so and that was a problem.
Per row: μ, binding ?x = <uri>
and BGP (?x ?p ?o)
becomes (join (?x ?p ?o) Table(μ))
.
The restriction on ?x
is achieved by an algebra join involving values. This is done for any occurrence of the 3 places variables can be matched in SPARQL (it excludes AS ?x
).
join
isn't the only possible choice. It's advantage is that it copes with an empty solution mapping to become the join-identity and it uses an existing operator. A new (inject μ bgp)
would be possible.
Restriction after binding does work for the three places needed. It does not work for a general pattern.
The same concern arises for MINUS where replacement semantics can change the domain and can cause unexpected changes.
SEP-0007 would benefit from some examples.
the problem arises because, when i read
Definition: Values Insertion
For solution mapping μ, define Table(μ) to be the multiset formed from μ.
Table(μ) = { μ }
Card[μ] = 1
Define the Values Insertion function Replace(X, μ) to replace each occurrence Y of a
Basic Graph Pattern, Property Path Expression, Graph(Var, pattern) in X
with join(Y, Table(μ)).
in sep-0007, it is not evident how and when this is to be accomplished.
the mechanism suggested by text there regarding addressing issue 3, "the possible solutions are restricted by the current row", is underdefined.
the text, above, gives a better idea.
@lisp - thank you for highlighting what is not clear. I'll rework that definition text and add examples.
if one feels that the tests in
https://github.com/apache/jena/tree/main/jena-arq/testing/ARQ/Lateral
are adequate, i will go ahead and implement "lateral" based on the same mechanism which we use for "exists" and report results for those tests.
before i discuss the results, there is a question about the tests.
as i read them, i would expect that the dependent clauses in the lateral-4 and lateral-5 tests would use the :label
predicate, rather than the :p
predicate which appears in the query texts.
@lisp I agree with your reading based on the variable names, but the presented .srj
results are consistent with the <http://example/p>
predicate used in the queries and the dataset.
Seems like something for @afs to give a comment on.
@lisp Thanks.
The use of ?label
is a bit of a distraction - renamed.
i looked at (some of) the literature on this issue, thought about how to integrate an approach into dydra, and completed an initial implementation.
the current mechanism distinguishes those cases which must employ dynamic environments from those for which an alternative aggregation operator suffices.
the discussion and examples are available for review in a notebook.
LATERAL
keyword as per @afs's https://github.com/w3c/sparql-12/blob/main/SEP/SEP-0006/sep-0006.md. (our ref https://ontotext.atlassian.net/browse/GDB-8319)
- Ontotext GraphDB currently supports this with an empty VALUES clause, see https://www.ontotext.com/blog/graphdb-users-ask-does-graphdb-support-reusing-values-in-sub-selects/. This is similar to Multivalued query params #161 (comment) proposed by @lisp
i understand the connection to #161, but, if you elect to introduce an operator, is it still necessary? if so, please describe the cases which require it.
With regards to LATERAL
enhancement and its companion enhancement for inject, how should the following query be interpreted?
SELECT * {
?s :value ?v
LATERAL {
SELECT DISTINCT ?o {
?o :value ?w .
FILTER(?w > ?v)
}
}
}
The intent of the query is to select the ?o
s that have a :value
?w
larger than the :value
?v
of ?s
.
However, based on SEP-0007 it would seem that ?v
isn't bound in the evaluation of the sub-select because it isn't in its projection. This can obviously be fixed:
SELECT * {
?s :value ?v
LATERAL {
SELECT DISTINCT ?v ?o {
?o :value ?w .
FILTER(?w > ?v)
}
}
}
However, this is a bit awkward, as ?v can never have the output role of the sub-select to use the wording from Correlation and Substitution in SPARQL.
@frensjan - nice - this brings out the difference between LATERAL and parameterised queries by variable substitution.
There are two different ?v
variable that never meet (i.e. get joined).
As a query, the inner ?v
can be replaced by any variable not used elsewhere and the query results are the same.
(Jena renames the inner variable so it does not have to have code for scoping issues in other query execution code.)
This is a difference between LATERAL and parameterised queries by variable substitution. I believe they are both are valid use cases - the same machinery applies but there are slight differences in how it is used, here how to treat scope-hidden variables.
It doesn't even have to involve LATERAL:
As a parameterized query:
SELECT ?o ?w {
?o :value ?w .
FILTER(?w > ?v)
}
makes sense. As a query, it rejects everything because the FILTER is error.
https://github.com/w3c/sparql-dev/issues/57 and others.
the change to scoping rules is one of the reasons for LATERAL. see https://observablehq.com/d/dba772c1bf5ed10f
Thanks for the reactions. So would you @afs and @lisp say my second query should work as intended?
If so, don't you agree that it is awkward to select a variable ?v
(which is typically used to expressed the variable is in the output role of the (sub-)select) to convey that it is in the input role?
@afs, what did you mean by
It doesn't even have to involve LATERAL:
The query that I gave as an example translates to something like:
SELECT * {
?s :value ?v
?o :value ?w .
FILTER(?w > ?v)
}
But that's not really the issue I wanted to address.
A more involved example:
PREFIX : <http://example.org>
SELECT * WHERE {
# messages and when they were created
?message1 a :Message .
?message1 :created ?referenceTimestamp .
# for each message, the last ten messages that were created before it
LATERAL {
SELECT ?referenceTimestamp ?message2 {
?message2 a :Message .
?message2 :created ?created .
FILTER( ?created < ?referenceTimestamp)
}
ORDER BY DESC( ?created )
LIMIT 10
}
}
Is the consensus that (given SEP-0006 and 0007) this would work as intended (as per the comments)?
my second query should work as intended
Yes. ?v
has to be in-scope at the top of the LHS (left hand side) of LATERAL.
# for each message, the last ten messages that were created before it
Yes! That's a very important application. The "up to 10 of" for each item is very hard to do otherwise.
?/created
FWIW That's a renamed variable.
We're fleshing out the implementation of LATERAL
for RDF4J; another question to make sure we get the intended semantics right:
In the Oxigraph tests there are two queries:
PREFIX ex: <http://example.org/>
SELECT ?s ?o WHERE {
?s a ex:T.
OPTIONAL { LATERAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}
and
PREFIX ex: <http://example.org/>
SELECT ?s ?o WHERE {
?s a ex:T.
LATERAL { OPTIONAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}
From the expected results, it seems that in the first query ?s
is considered out of scope for evaluation for each ?s a ex:T.
while in the second query it is in scope.
For me it's surprising that the semantics of these two queries differ. I consider both OPTIONAL
and LATERAL
to have 'nested loop' like semantics; i.e., for each solution / binding set from the left-hand side, variable bindings are 'injected into' the right-hand side graph pattern.
I would consider { P } OPTIONAL { { } LATERAL Q }
to be equivalent to { P } LATERAL { { } OPTIONAL Q }
.
I did noticed a subtle difference in the algebra for Jena:
(project (?s ?o)
(conditional
(bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
(lateral
(table unit)
(project (?s ?o)
(top (2 ?o)
(bgp (triple ?s <http://example.org/p> ?o)))))))
vs
(project (?s ?o)
(lateral
(bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
(leftjoin
(table unit)
(project (?s ?o)
(top (2 ?o)
(bgp (triple ?s <http://example.org/p> ?o)))))))
Note the conditional
vs leftjoin
operators.
I haven't tested the actual behaviour of Jena yet. But as Jena uses variable renaming, I would expect ?s
to be renamed to ?/s
if it were out of scope.
We're fleshing out the implementation of LATERAL for RDF4J
Amazing!
For me it's surprising that the semantics of these two queries differ. I consider both OPTIONAL and LATERAL to have 'nested loop' like semantics; i.e., for each solution / binding set from the left-hand side, variable bindings are 'injected into' the right-hand side graph pattern.
This is the trap. OPTIONAL
do not have "nested loop" semantic but a "left join" semantic. So, in A OPTIONAL { B }
, A
and B
are supposed to be evaluated independently then joined. See this definition in the SPARQL spec. For example the query
SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } }
can be rewritten
SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(false) } }
because ?a
is not bound inside of the OPTIONAL
and so will return only a=1
wheras
SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } }
can be rewritten
SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(true) } }
because ?a
is now bound inside of the LATERAL
and so will return a=1 b=2
.
This is why rewriting OPTIONAL
into nested loops is hard because one needs to take care of not changing the results.
Why?
GraphQL is a hot topic amongst developers and tool vendors. There are several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation.
GraphQL queries are hierarchical and quite regular. Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level.
Assuming a simple companies graph structure and some reasonable
order
andlimit
syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL:If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem.
limit
in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city.We at Onto believe that to implement this efficiently, you need the subquery to run in a loop for every row of the parent query.
Previous work
This is a common problem in databases. Eg see StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?, which gives the following solutions:
<child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>)
aka using Windowing functions<parent-query> JOIN LATERAL (<child-query>)
, see PostgreSQL (FROM, Lateral, SELECT, heap.io blog Dec 2014), SQL Server (cross apply)WITH
Common Table ExpressionJOIN (COUNT...GROUP BY) WHERE <=5
A Correlated Subquery is like
LEFT JOIN LATERAL ... ON true
, see StackOverflow: Difference between Lateral and Subquery:FROM x1,x2
wherex2
is a table function (egunnest
)Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning). However, LATERAL is usually faster than Windowing functions (depending on data and indexing).
Two of the leading GraphQL implementations for RDMBS use LATERAL: Hasura and Join Monster
Proposed solution
A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows.
rdf:List
(Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows)
Assume this construct (other syntax suggestions are welcome!):
?var
must be bound byfoo
(and exported in itsselect
if it's a subquery)?var
must also be exported bybar
bar
is iterated for every binding of?var
bar
are appended to the results offoo
Then we could use it to implement the query in question:
(It's more likely to have inverse links
?city x:country ?country
and?company x:city ?city
but for simplicity we use straight links)Considerations for backward compatibility
None?