Open VladimirAlexiev opened 5 years ago
Thanks @Tpt. This maybe a bit broader of an issue in RDF4J perhaps ... I'l look into that.
This may also be an issue with the Jena implementation then @afs ? I've tested the Oxigraph queries and they both yield the same results given the test data; see below.
(I could have perhaps created an issue for Jena, but as this concerns the semantics of LATERAL somewhat I reported it here)
$ arq -version
Apache Jena version 4.10.0
$ cat data.ttl
@prefix ex: <http://example.org/> .
ex:s1 a ex:T ; ex:p 11 , 12 , 13 .
ex:s2 a ex:T ; ex:p 21 , 22 , 23 .
ex:s3 a ex:T .
$ arq --explain --data data.ttl --query optional-in-lateral.rq
12:44:39 INFO exec :: QUERY
PREFIX ex: <http://example.org/>
SELECT ?s ?o
WHERE
{ ?s a ex:T
LATERAL
{ OPTIONAL
{ SELECT ?s ?o
WHERE
{ ?s ex:p ?o }
ORDER BY ?o
LIMIT 2
}
}
}
12:44:39 INFO exec :: ALGEBRA
(project (?s ?o)
(lateral
(bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
(leftjoin
(table unit)
(project (?s ?o)
(top (2 ?o)
(bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:39 INFO exec :: BGP :: ?s rdf:type <http://example.org/T>
12:44:39 INFO exec :: Reorder/generic :: ?s rdf:type <http://example.org/T>
12:44:39 INFO exec :: BGP :: <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO exec :: Reorder/generic :: <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO exec :: BGP :: <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO exec :: Reorder/generic :: <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO exec :: BGP :: <http://example.org/s2> <http://example.org/p> ?o
12:44:39 INFO exec :: Reorder/generic :: <http://example.org/s2> <http://example.org/p> ?o
--------------
| s | o |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 | |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------
$ arq --explain --data data.ttl --query lateral-in-optional.rq
12:44:45 INFO exec :: QUERY
PREFIX ex: <http://example.org/>
SELECT ?s ?o
WHERE
{ ?s a ex:T
OPTIONAL
{ LATERAL
{ SELECT ?s ?o
WHERE
{ ?s ex:p ?o }
ORDER BY ?o
LIMIT 2
}
}
}
12:44:45 INFO exec :: ALGEBRA
(project (?s ?o)
(conditional
(bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
(lateral
(table unit)
(project (?s ?o)
(top (2 ?o)
(bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:45 INFO exec :: BGP :: ?s rdf:type <http://example.org/T>
12:44:45 INFO exec :: Reorder/generic :: ?s rdf:type <http://example.org/T>
12:44:45 INFO exec :: BGP :: <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO exec :: Reorder/generic :: <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO exec :: BGP :: <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO exec :: Reorder/generic :: <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO exec :: BGP :: <http://example.org/s2> <http://example.org/p> ?o
12:44:45 INFO exec :: Reorder/generic :: <http://example.org/s2> <http://example.org/p> ?o
--------------
| s | o |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 | |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------
Hi @frensjan -- thanks for providing the clear details.
The results look like the same rows to me. And I get the same results locally.
What's the issue you've noticed?
conditional
is a particular way to execute leftjoin
.
Renaming only happens if a variable does not remain in-scope up the evaluation - ?s
does remain in-scope in this case.
What's the issue you've noticed?
Well, the issue is that it's a different result than when you run these queries with Oxigraph apparently!
conditional is a particular way to execute leftjoin.
That's what I've read in the javadoc. I could imagine that it's backed by a different evaluation algorithm and would explain differences.
Renaming only happens if a variable does not remain in-scope up the evaluation - ?s does remain in-scope in this case.
That's also my initial 'stance'. But apparently in Oxigraph this is different?
Please could could record the Oxigraph results here for the record?
Jena5 gets the same results, different order:
--------------
| s | o |
==============
| ex:s3 | |
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------
(there is a different default in-memory graph implementation)
The Oxigraph tests are on Github. The relevant files are:
tested the Oxigraph queries and they both yield the same results given the test data; see below.
I misread that as both queries give equal results
Those two are different results:
subselect_inside_optional.srx :
--------------
| s | o |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | |
| ex:s3 | |
--------------
subselect_outside_optional.srx:
--------------
| s | o |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
| ex:s3 | |
--------------
Jena gets the same as Oxigraph those if you use the "ref" query engine (--engine=ref
) or if you switch off the optimizer (--optimize=off
) so Jena is wrong for the normal engine. It's likely the conditional
.
The "ref" engine is very simple, runs exactly the algebra and materializes each intermediate step.
Thanks for investigating this further @afs. I assume choosing the conditional
is an incorrect optimisation by Jena.
RDF4J evaluates all left-joins in this manner (bindings from the LHS are used for variables with the same name on the LHS).
I've raised the issue at RDF4J (which is larger than just optional lateral).
I must say that this does strike me as odd ... I need to explain this to my users:
PREFIX : <http://example.org/>
SELECT * WHERE {
VALUES ?x { :x }
OPTIONAL {
FILTER( BOUND(?x) )
BIND( :y as ?y )
BIND( ?x as ?z )
}
}
yields
---------------
| x | y | z |
===============
| :x | :y | |
---------------
(I intentionally wrote the FILTER clause first in the optional block for dramatic effect)
I understand that this is as per the specification (filters in a group are applied to the group as a whole and if a left join has a group with a filter on the RHS, this filter is applied for the merged solutions from left and right that are compatible). But it's not that intuitive (for me, n=1).
Note that querying the Wikidata SPARQL endpoint (which is still backed by Blazegraph if I'm not mistaken) gives yet another result:
--------------
| x | y | z |
==============
| :x | | |
--------------
Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x
not being bound it 'just' throws an error:
Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned
Thanks for investigating this further @afs. I assume choosing the
conditional
is an incorrect optimisation by Jena.
Looking at subselect_inside_optional.srx - it is symmetric in :s1 and :s2 but the answers are different. The only asymmetry is the top 2 subselect.
I think that LATERAL rules apply inside the subselect because ?s
is not a hidden variable. I can only explain the results
--------------
| s | o |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | |
| ex:s3 | |
--------------
if the subselect executes with free ?s
(no lateral injection) when 11 and 12 are first in the ORDER BY
What am I missing?
When it's ?s=:s2
, LATERAL should making all execution within it's right hand side act like ?s
is :s2
only.
SELECT ?s ?o { BIND(ex:s2 AS ?s) ?s ex:p ?o .} ORDER BY ?o LIMIT 2
is 21 and 22 which I'd expect to be leftjoin'ed into the answers.
SELECT * WHERE {
?s a ex:T.
OPTIONAL { LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}
This is not to argue Jena results for subselect_inside_optional are correct. If they are, it's an accident of conditional
.
The unoptimized form is definitive and it agrees with Oxigraph.
Unoptimized evaluation does fails to propagate ?s
back leading to the effect that the subselect is not having the LATERAL binding any more. That is a bug. (@tpt will understand where public work time has gone.) The optimization, by accident, puts it back in this case by something unrelated to LATERAL.
@frensjan, in
But it's not that intuitive (for me, n=1).
"it's" is to mean the recommendation's algebra or the result which it produces in this instance?
@frensjan wrote --
Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x not being bound it 'just' throws an error:
Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned
It should perhaps be noted that this error is not mandatory behavior in Virtuoso. There's a checkbox on the /sparql
query form, "Strict checking of void variables", which is ticked by default (resulting in your error). Unticking that box (or adding &signal_unconnected=on
to a results URI) gets this --
----------------
| x | y | z |
================
| :x | | |
----------------
I confess to not yet fully understanding this issue. As it seem that the latter solution set is an error, I would appreciate it if someone who does fully understand this issue (and so can evaluate any patch for it) would log an issue on the Virtuoso project.
@frensjan, in But it's not that intuitive (for me, n=1). "it's" is to mean the recommendation's algebra or the result which it produces in this instance?
I meant the algebra, the semantics of the query (and thus the results produced). This may very well be biased by the fact that I do a lot of imperative programming. But the curly braces around the RHS of OPTIONAL for a lot of people probably provide some kind of hint that what's in it has the same 'data' available (or not).
Filters in OPTIONAL are special, as they have results from both LHS and RHS 'in scope'.
Variables in statement patterns on the RHS shared with the LHS form the join key, so they feel like they're 'in scope'. E.g. in SELECT * { ?s :p ?x OPTIONAL { ?s :q ?y } }
, ?s
will always have the same values on the LHS as on the RHS (if any compatible solution exists of course).
But for e.g. BIND this isn't the case. So you can write:
SELECT ?price ?discount {
VALUES ?price { 10 }
OPTIONAL {
VALUES ?discount { 0.10 }
FILTER( ?price * (1 - ?discount) < 10 )
}
}
But this query has different semantics:
SELECT ?price ?discount {
VALUES ?price { 10 }
OPTIONAL {
VALUES ?discount { 0.10 }
BIND( ?price * (1 - ?discount) AS ?effectivePrice )
FILTER( ?effectivePrice < 10 )
}
}
But perhaps I digress too much. This issue is about LATERAL, not OPTIONAL.
If RDF4J would be 'fixed' though, I would need to train my users in the difference between
P { LATERAL { OPTIONAL Q } }
and P { OPTIONAL { LATERAL Q }
.
E.g., this query gives, for every ?movie
, the average rating of all movies:
SELECT * {
?movie a :Movie
OPTIONAL { LATERAL {
{ SELECT (AVG(?rating) as ?averageRating) {
?movie :rating ?rating
} }
} }
}
That's just hard to grok for me.
What am I (@afs) missing?
That the OPTIONAL is has normal evaluation so the RHS is
{ LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2 }
No LHS ?s
until after LATERAL
is evaluated. The LHS of the LATERAL is the unit { }
.
https://ceur-ws.org/Vol-3471/paper8.pdf by @Aklakan Scaling RML and SPARQL-based Knowledge Graph Construction with Apache Spark sec 4.1. Merging CONSTRUCT Queries using LATERAL
Why?
GraphQL is a hot topic amongst developers and tool vendors. There are several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation.
GraphQL queries are hierarchical and quite regular. Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level.
Assuming a simple companies graph structure and some reasonable
order
andlimit
syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL:If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem.
limit
in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city.We at Onto believe that to implement this efficiently, you need the subquery to run in a loop for every row of the parent query.
Previous work
This is a common problem in databases. Eg see StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?, which gives the following solutions:
<child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>)
aka using Windowing functions<parent-query> JOIN LATERAL (<child-query>)
, see PostgreSQL (FROM, Lateral, SELECT, heap.io blog Dec 2014), SQL Server (cross apply)WITH
Common Table ExpressionJOIN (COUNT...GROUP BY) WHERE <=5
A Correlated Subquery is like
LEFT JOIN LATERAL ... ON true
, see StackOverflow: Difference between Lateral and Subquery:FROM x1,x2
wherex2
is a table function (egunnest
)Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning). However, LATERAL is usually faster than Windowing functions (depending on data and indexing).
Two of the leading GraphQL implementations for RDMBS use LATERAL: Hasura and Join Monster
Proposed solution
A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows.
rdf:List
(Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows)
Assume this construct (other syntax suggestions are welcome!):
?var
must be bound byfoo
(and exported in itsselect
if it's a subquery)?var
must also be exported bybar
bar
is iterated for every binding of?var
bar
are appended to the results offoo
Then we could use it to implement the query in question:
(It's more likely to have inverse links
?city x:country ?country
and?company x:city ?city
but for simplicity we use straight links)Considerations for backward compatibility
None?