JOIN LATERAL or Correlated Subquery

VladimirAlexiev commented 5 years ago

Why?

GraphQL is a hot topic amongst developers and tool vendors. There are several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation.

GraphQL queries are hierarchical and quite regular. Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level.

Assuming a simple companies graph structure and some reasonable order and limit syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL:

{
  country(id:"...bulgaria") {
    city (order: {population:DESC}, limit: 2) {
      id name population
      company (order: {revenue:DESC}, limit: 2) {
        id name revenue
      }
    }
  }
}

If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem. limit in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city.

We at Onto believe that to implement this efficiently, you need the subquery to run in a loop for every row of the parent query.

Previous work

This is a common problem in databases. Eg see StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?, which gives the following solutions:

<child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>) aka using Windowing functions
<parent-query> JOIN LATERAL (<child-query>), see PostgreSQL (FROM, Lateral, SELECT, heap.io blog Dec 2014), SQL Server (cross apply)
WITH Common Table Expression
JOIN (COUNT...GROUP BY) WHERE <=5

A Correlated Subquery is like LEFT JOIN LATERAL ... ON true, see StackOverflow: Difference between Lateral and Subquery:

Correlated subquery has limitations: it cannot return multiple columns and multiple rows
LATERAL is made for that: it's assumed automatically for cross-joins FROM x1,x2 where x2 is a table function (eg unnest)

Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning). However, LATERAL is usually faster than Windowing functions (depending on data and indexing).

Two of the leading GraphQL implementations for RDMBS use LATERAL: Hasura and Join Monster

Proposed solution

A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows.

We've tried structuring with CONSTRUCT but it doesn't support ordering and we don't want to mess with rdf:List
StarDog's Extending the Solution with ARRAY may be relevant, but will require too large extension of SPARQL
So we're thinking of using SELECT with a UNION-style distribution of parent and child values

(Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows)

country	city	city_name	population	company	company_name	revenue	COMMENT
geo:732800/	geo:727011/	Sofia	1152556
	geo:727011/			co:123	Sofia Foo Co	987	Sofia company 1
	geo:727011/			co:456	Sofia Bar Co	123	Sofia company 2
geo:732800/	geo:728193/	Plovdiv	340494
	geo:728193/			co:789	Plovdiv Foo Co	987	Plovdiv company 1
	geo:728193/			co:012	Plovdiv Bar Co	123	Plovdiv company 2

Assume this construct (other syntax suggestions are welcome!):

{foo} UNION LATERAL(?var) {bar}

?var must be bound by foo (and exported in its select if it's a subquery)
?var must also be exported by bar
bar is iterated for every binding of ?var
the results of bar are appended to the results of foo

Then we could use it to implement the query in question:

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  {select ?country ?city ?city_name ?population {
    bind(<http://sws.geonames.org/732770/> as ?country)
    ?country x:city ?city.
    ?city x:name ?city_name.
    ?city x:population ?population.
  } order by desc(?population) limit 2}
  UNION LATERAL(?city)
  {select ?city ?company ?company_name ?revenue {
    ?city x:company ?company.
    ?company x:name ?company_name.
    ?company x:revenue ?revenue
  } order by desc(?revenue) limit 2}
}

(It's more likely to have inverse links ?city x:country ?country and ?company x:city ?city but for simplicity we use straight links)

Considerations for backward compatibility

None?

frensjan commented 8 months ago

Thanks @Tpt. This maybe a bit broader of an issue in RDF4J perhaps ... I'l look into that.

frensjan commented 8 months ago

This may also be an issue with the Jena implementation then @afs ? I've tested the Oxigraph queries and they both yield the same results given the test data; see below.

(I could have perhaps created an issue for Jena, but as this concerns the semantics of LATERAL somewhat I reported it here)

$ arq -version
Apache Jena version 4.10.0

$ cat data.ttl
@prefix ex: <http://example.org/> .

ex:s1 a ex:T ; ex:p 11 , 12 , 13 .
ex:s2 a ex:T ; ex:p 21 , 22 , 23 .
ex:s3 a ex:T .

$ arq --explain --data data.ttl --query optional-in-lateral.rq
12:44:39 INFO  exec            :: QUERY
  PREFIX  ex:   <http://example.org/>

  SELECT  ?s ?o
  WHERE
    { ?s  a  ex:T
      LATERAL
        { OPTIONAL
            { SELECT  ?s ?o
              WHERE
                { ?s  ex:p  ?o }
              ORDER BY ?o
              LIMIT   2
            }
        }
    }
12:44:39 INFO  exec            :: ALGEBRA
  (project (?s ?o)
    (lateral
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (leftjoin
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:39 INFO  exec            :: BGP ::   ?s rdf:type <http://example.org/T>
12:44:39 INFO  exec            :: Reorder/generic ::   ?s rdf:type <http://example.org/T>
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s2> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s2> <http://example.org/p> ?o
--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 |    |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------

$ arq --explain --data data.ttl --query lateral-in-optional.rq
12:44:45 INFO  exec            :: QUERY
  PREFIX  ex:   <http://example.org/>

  SELECT  ?s ?o
  WHERE
    { ?s  a  ex:T
      OPTIONAL
        { LATERAL
            { SELECT  ?s ?o
              WHERE
                { ?s  ex:p  ?o }
              ORDER BY ?o
              LIMIT   2
            }
        }
    }
12:44:45 INFO  exec            :: ALGEBRA
  (project (?s ?o)
    (conditional
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (lateral
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:45 INFO  exec            :: BGP ::   ?s rdf:type <http://example.org/T>
12:44:45 INFO  exec            :: Reorder/generic ::   ?s rdf:type <http://example.org/T>
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s2> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s2> <http://example.org/p> ?o
--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 |    |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------

afs commented 8 months ago

Hi @frensjan -- thanks for providing the clear details.

The results look like the same rows to me. And I get the same results locally.

What's the issue you've noticed?

conditional is a particular way to execute leftjoin.

Renaming only happens if a variable does not remain in-scope up the evaluation - ?s does remain in-scope in this case.

frensjan commented 8 months ago

What's the issue you've noticed?

Well, the issue is that it's a different result than when you run these queries with Oxigraph apparently!

conditional is a particular way to execute leftjoin.

That's what I've read in the javadoc. I could imagine that it's backed by a different evaluation algorithm and would explain differences.

Renaming only happens if a variable does not remain in-scope up the evaluation - ?s does remain in-scope in this case.

That's also my initial 'stance'. But apparently in Oxigraph this is different?

afs commented 8 months ago

Please could could record the Oxigraph results here for the record?

Jena5 gets the same results, different order:

--------------
| s     | o  |
==============
| ex:s3 |    |
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------

(there is a different default in-memory graph implementation)

frensjan commented 8 months ago

The Oxigraph tests are on Github. The relevant files are:

subselect_inside_optional.rq
subselect_inside_optional.srx
subselect_outside_optional.rq
subselect_outside_optional.srx

afs commented 8 months ago

tested the Oxigraph queries and they both yield the same results given the test data; see below.

I misread that as both queries give equal results

Those two are different results:

subselect_inside_optional.srx :

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 |    |
| ex:s3 |    |
--------------

subselect_outside_optional.srx:

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
| ex:s3 |    |
--------------

afs commented 8 months ago

Jena gets the same as Oxigraph those if you use the "ref" query engine (--engine=ref) or if you switch off the optimizer (--optimize=off) so Jena is wrong for the normal engine. It's likely the conditional.

The "ref" engine is very simple, runs exactly the algebra and materializes each intermediate step.

https://github.com/apache/jena/issues/2218

frensjan commented 8 months ago

Thanks for investigating this further @afs. I assume choosing the conditional is an incorrect optimisation by Jena.

RDF4J evaluates all left-joins in this manner (bindings from the LHS are used for variables with the same name on the LHS).

I've raised the issue at RDF4J (which is larger than just optional lateral).

Maybe not the easiest to explain to users (n=1)

I must say that this does strike me as odd ... I need to explain this to my users:

PREFIX : <http://example.org/>

SELECT * WHERE {
    VALUES ?x { :x }
    OPTIONAL {
        FILTER( BOUND(?x) )
        BIND( :y as ?y )
        BIND( ?x as ?z )
    }
}

yields

---------------
| x  | y  | z |
===============
| :x | :y |   |
---------------

(I intentionally wrote the FILTER clause first in the optional block for dramatic effect)

I understand that this is as per the specification (filters in a group are applied to the group as a whole and if a left join has a group with a filter on the RHS, this filter is applied for the merged solutions from left and right that are compatible). But it's not that intuitive (for me, n=1).

Blazegraph

Note that querying the Wikidata SPARQL endpoint (which is still backed by Blazegraph if I'm not mistaken) gives yet another result:

--------------
| x  | y | z |
==============
| :x |   |   |
--------------

Virtuoso

Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x not being bound it 'just' throws an error:

Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned

afs commented 8 months ago

Thanks for investigating this further @afs. I assume choosing the conditional is an incorrect optimisation by Jena.

Looking at subselect_inside_optional.srx - it is symmetric in :s1 and :s2 but the answers are different. The only asymmetry is the top 2 subselect.

I think that LATERAL rules apply inside the subselect because ?s is not a hidden variable. I can only explain the results

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 |    |
| ex:s3 |    |
--------------

if the subselect executes with free ?s (no lateral injection) when 11 and 12 are first in the ORDER BY

What am I missing? When it's ?s=:s2, LATERAL should making all execution within it's right hand side act like ?s is :s2 only.

SELECT ?s ?o { BIND(ex:s2 AS ?s) ?s ex:p ?o .} ORDER BY ?o LIMIT 2 is 21 and 22 which I'd expect to be leftjoin'ed into the answers.

SELECT * WHERE {
    ?s a ex:T.
    OPTIONAL { LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

This is not to argue Jena results for subselect_inside_optional are correct. If they are, it's an accident of conditional. The unoptimized form is definitive and it agrees with Oxigraph.

Unoptimized evaluation does fails to propagate ?s back leading to the effect that the subselect is not having the LATERAL binding any more. That is a bug. (@tpt will understand where public work time has gone.) The optimization, by accident, puts it back in this case by something unrelated to LATERAL.

lisp commented 8 months ago

@frensjan, in

But it's not that intuitive (for me, n=1).

"it's" is to mean the recommendation's algebra or the result which it produces in this instance?

TallTed commented 8 months ago

@frensjan wrote --

Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x not being bound it 'just' throws an error:

Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned

It should perhaps be noted that this error is not mandatory behavior in Virtuoso. There's a checkbox on the /sparql query form, "Strict checking of void variables", which is ticked by default (resulting in your error). Unticking that box (or adding &signal_unconnected=on to a results URI) gets this --

----------------
| x  | y  | z  |
================
| :x |    |    |
----------------

I confess to not yet fully understanding this issue. As it seem that the latter solution set is an error, I would appreciate it if someone who does fully understand this issue (and so can evaluate any patch for it) would log an issue on the Virtuoso project.

frensjan commented 8 months ago

@frensjan, in But it's not that intuitive (for me, n=1). "it's" is to mean the recommendation's algebra or the result which it produces in this instance?

I meant the algebra, the semantics of the query (and thus the results produced). This may very well be biased by the fact that I do a lot of imperative programming. But the curly braces around the RHS of OPTIONAL for a lot of people probably provide some kind of hint that what's in it has the same 'data' available (or not).

Filters in OPTIONAL are special, as they have results from both LHS and RHS 'in scope'.

Variables in statement patterns on the RHS shared with the LHS form the join key, so they feel like they're 'in scope'. E.g. in SELECT * { ?s :p ?x OPTIONAL { ?s :q ?y } }, ?s will always have the same values on the LHS as on the RHS (if any compatible solution exists of course).

But for e.g. BIND this isn't the case. So you can write:

SELECT ?price ?discount {
    VALUES ?price { 10 }
    OPTIONAL {
        VALUES ?discount { 0.10 }
        FILTER( ?price * (1 - ?discount) < 10 )
    }
}

But this query has different semantics:

SELECT ?price ?discount {
    VALUES ?price { 10 }
    OPTIONAL {
        VALUES ?discount { 0.10 }
        BIND( ?price * (1 - ?discount) AS ?effectivePrice )
        FILTER( ?effectivePrice < 10 )
    }
}

frensjan commented 8 months ago

But perhaps I digress too much. This issue is about LATERAL, not OPTIONAL.

If RDF4J would be 'fixed' though, I would need to train my users in the difference between P { LATERAL { OPTIONAL Q } } and P { OPTIONAL { LATERAL Q }.

E.g., this query gives, for every ?movie, the average rating of all movies:

SELECT * {
    ?movie a :Movie
    OPTIONAL { LATERAL {
        { SELECT (AVG(?rating) as ?averageRating) {
            ?movie :rating ?rating
        } }
    } }
}

That's just hard to grok for me.

afs commented 8 months ago

What am I (@afs) missing?

That the OPTIONAL is has normal evaluation so the RHS is

{ LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2 }

No LHS ?s until after LATERAL is evaluated. The LHS of the LATERAL is the unit { }.

VladimirAlexiev commented 7 months ago

https://ceur-ws.org/Vol-3471/paper8.pdf by @Aklakan Scaling RML and SPARQL-based Knowledge Graph Construction with Apache Spark sec 4.1. Merging CONSTRUCT Queries using LATERAL

w3c / sparql-dev