w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
123 stars 19 forks source link

JOIN LATERAL or Correlated Subquery #100

Open VladimirAlexiev opened 5 years ago

VladimirAlexiev commented 5 years ago

Why?

GraphQL is a hot topic amongst developers and tool vendors. There are several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation.

GraphQL queries are hierarchical and quite regular. Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level.

Assuming a simple companies graph structure and some reasonable order and limit syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL:

{
  country(id:"...bulgaria") {
    city (order: {population:DESC}, limit: 2) {
      id name population
      company (order: {revenue:DESC}, limit: 2) {
        id name revenue
      }
    }
  }
}

If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem. limit in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city.

We at Onto believe that to implement this efficiently, you need the subquery to run in a loop for every row of the parent query.

Previous work

This is a common problem in databases. Eg see StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?, which gives the following solutions:

  1. <child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>) aka using Windowing functions
  2. <parent-query> JOIN LATERAL (<child-query>), see PostgreSQL (FROM, Lateral, SELECT, heap.io blog Dec 2014), SQL Server (cross apply)
  3. WITH Common Table Expression
  4. JOIN (COUNT...GROUP BY) WHERE <=5

A Correlated Subquery is like LEFT JOIN LATERAL ... ON true, see StackOverflow: Difference between Lateral and Subquery:

Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning). However, LATERAL is usually faster than Windowing functions (depending on data and indexing).

Two of the leading GraphQL implementations for RDMBS use LATERAL: Hasura and Join Monster

Proposed solution

A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows.

(Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows)

country city city_name population company company_name revenue COMMENT
geo:732800/ geo:727011/ Sofia 1152556
geo:727011/ co:123 Sofia Foo Co 987 Sofia company 1
geo:727011/ co:456 Sofia Bar Co 123 Sofia company 2
geo:732800/ geo:728193/ Plovdiv 340494
geo:728193/ co:789 Plovdiv Foo Co 987 Plovdiv company 1
geo:728193/ co:012 Plovdiv Bar Co 123 Plovdiv company 2

Assume this construct (other syntax suggestions are welcome!):

{foo} UNION LATERAL(?var) {bar}

Then we could use it to implement the query in question:

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  {select ?country ?city ?city_name ?population {
    bind(<http://sws.geonames.org/732770/> as ?country)
    ?country x:city ?city.
    ?city x:name ?city_name.
    ?city x:population ?population.
  } order by desc(?population) limit 2}
  UNION LATERAL(?city)
  {select ?city ?company ?company_name ?revenue {
    ?city x:company ?company.
    ?company x:name ?company_name.
    ?company x:revenue ?revenue
  } order by desc(?revenue) limit 2}
}

(It's more likely to have inverse links ?city x:country ?country and ?company x:city ?city but for simplicity we use straight links)

Considerations for backward compatibility

None?

frensjan commented 8 months ago

Thanks @Tpt. This maybe a bit broader of an issue in RDF4J perhaps ... I'l look into that.

frensjan commented 8 months ago

This may also be an issue with the Jena implementation then @afs ? I've tested the Oxigraph queries and they both yield the same results given the test data; see below.

(I could have perhaps created an issue for Jena, but as this concerns the semantics of LATERAL somewhat I reported it here)

$ arq -version
Apache Jena version 4.10.0
$ cat data.ttl
@prefix ex: <http://example.org/> .

ex:s1 a ex:T ; ex:p 11 , 12 , 13 .
ex:s2 a ex:T ; ex:p 21 , 22 , 23 .
ex:s3 a ex:T .
$ arq --explain --data data.ttl --query optional-in-lateral.rq
12:44:39 INFO  exec            :: QUERY
  PREFIX  ex:   <http://example.org/>

  SELECT  ?s ?o
  WHERE
    { ?s  a  ex:T
      LATERAL
        { OPTIONAL
            { SELECT  ?s ?o
              WHERE
                { ?s  ex:p  ?o }
              ORDER BY ?o
              LIMIT   2
            }
        }
    }
12:44:39 INFO  exec            :: ALGEBRA
  (project (?s ?o)
    (lateral
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (leftjoin
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:39 INFO  exec            :: BGP ::   ?s rdf:type <http://example.org/T>
12:44:39 INFO  exec            :: Reorder/generic ::   ?s rdf:type <http://example.org/T>
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s2> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s2> <http://example.org/p> ?o
--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 |    |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------
$ arq --explain --data data.ttl --query lateral-in-optional.rq
12:44:45 INFO  exec            :: QUERY
  PREFIX  ex:   <http://example.org/>

  SELECT  ?s ?o
  WHERE
    { ?s  a  ex:T
      OPTIONAL
        { LATERAL
            { SELECT  ?s ?o
              WHERE
                { ?s  ex:p  ?o }
              ORDER BY ?o
              LIMIT   2
            }
        }
    }
12:44:45 INFO  exec            :: ALGEBRA
  (project (?s ?o)
    (conditional
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (lateral
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:45 INFO  exec            :: BGP ::   ?s rdf:type <http://example.org/T>
12:44:45 INFO  exec            :: Reorder/generic ::   ?s rdf:type <http://example.org/T>
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s2> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s2> <http://example.org/p> ?o
--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 |    |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------
afs commented 8 months ago

Hi @frensjan -- thanks for providing the clear details.

The results look like the same rows to me. And I get the same results locally.

What's the issue you've noticed?

conditional is a particular way to execute leftjoin.

Renaming only happens if a variable does not remain in-scope up the evaluation - ?s does remain in-scope in this case.

frensjan commented 8 months ago

What's the issue you've noticed?

Well, the issue is that it's a different result than when you run these queries with Oxigraph apparently!

conditional is a particular way to execute leftjoin.

That's what I've read in the javadoc. I could imagine that it's backed by a different evaluation algorithm and would explain differences.

Renaming only happens if a variable does not remain in-scope up the evaluation - ?s does remain in-scope in this case.

That's also my initial 'stance'. But apparently in Oxigraph this is different?

afs commented 8 months ago

Please could could record the Oxigraph results here for the record?

Jena5 gets the same results, different order:

--------------
| s     | o  |
==============
| ex:s3 |    |
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------

(there is a different default in-memory graph implementation)

frensjan commented 8 months ago

The Oxigraph tests are on Github. The relevant files are:

afs commented 8 months ago

tested the Oxigraph queries and they both yield the same results given the test data; see below.

I misread that as both queries give equal results

Those two are different results:

subselect_inside_optional.srx :

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 |    |
| ex:s3 |    |
--------------

subselect_outside_optional.srx:

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
| ex:s3 |    |
--------------
afs commented 8 months ago

Jena gets the same as Oxigraph those if you use the "ref" query engine (--engine=ref) or if you switch off the optimizer (--optimize=off) so Jena is wrong for the normal engine. It's likely the conditional.

The "ref" engine is very simple, runs exactly the algebra and materializes each intermediate step.

https://github.com/apache/jena/issues/2218

frensjan commented 8 months ago

Thanks for investigating this further @afs. I assume choosing the conditional is an incorrect optimisation by Jena.

RDF4J evaluates all left-joins in this manner (bindings from the LHS are used for variables with the same name on the LHS).

I've raised the issue at RDF4J (which is larger than just optional lateral).


Maybe not the easiest to explain to users (n=1)

I must say that this does strike me as odd ... I need to explain this to my users:

PREFIX : <http://example.org/>

SELECT * WHERE {
    VALUES ?x { :x }
    OPTIONAL {
        FILTER( BOUND(?x) )
        BIND( :y as ?y )
        BIND( ?x as ?z )
    }
}

yields

---------------
| x  | y  | z |
===============
| :x | :y |   |
---------------

(I intentionally wrote the FILTER clause first in the optional block for dramatic effect)

I understand that this is as per the specification (filters in a group are applied to the group as a whole and if a left join has a group with a filter on the RHS, this filter is applied for the merged solutions from left and right that are compatible). But it's not that intuitive (for me, n=1).

Blazegraph

Note that querying the Wikidata SPARQL endpoint (which is still backed by Blazegraph if I'm not mistaken) gives yet another result:

--------------
| x  | y | z |
==============
| :x |   |   |
--------------

Virtuoso

Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x not being bound it 'just' throws an error:

Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned

afs commented 8 months ago

Thanks for investigating this further @afs. I assume choosing the conditional is an incorrect optimisation by Jena.

Looking at subselect_inside_optional.srx - it is symmetric in :s1 and :s2 but the answers are different. The only asymmetry is the top 2 subselect.

I think that LATERAL rules apply inside the subselect because ?s is not a hidden variable. I can only explain the results

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 |    |
| ex:s3 |    |
--------------

if the subselect executes with free ?s (no lateral injection) when 11 and 12 are first in the ORDER BY

What am I missing? When it's ?s=:s2, LATERAL should making all execution within it's right hand side act like ?s is :s2 only.

SELECT ?s ?o { BIND(ex:s2 AS ?s) ?s ex:p ?o .} ORDER BY ?o LIMIT 2 is 21 and 22 which I'd expect to be leftjoin'ed into the answers.

SELECT * WHERE {
    ?s a ex:T.
    OPTIONAL { LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

This is not to argue Jena results for subselect_inside_optional are correct. If they are, it's an accident of conditional. The unoptimized form is definitive and it agrees with Oxigraph.

Unoptimized evaluation does fails to propagate ?s back leading to the effect that the subselect is not having the LATERAL binding any more. That is a bug. (@tpt will understand where public work time has gone.) The optimization, by accident, puts it back in this case by something unrelated to LATERAL.

lisp commented 8 months ago

@frensjan, in

But it's not that intuitive (for me, n=1).

"it's" is to mean the recommendation's algebra or the result which it produces in this instance?

TallTed commented 8 months ago

@frensjan wrote --

Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x not being bound it 'just' throws an error:

Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned

It should perhaps be noted that this error is not mandatory behavior in Virtuoso. There's a checkbox on the /sparql query form, "Strict checking of void variables", which is ticked by default (resulting in your error). Unticking that box (or adding &signal_unconnected=on to a results URI) gets this --

----------------
| x  | y  | z  |
================
| :x |    |    |
----------------

I confess to not yet fully understanding this issue. As it seem that the latter solution set is an error, I would appreciate it if someone who does fully understand this issue (and so can evaluate any patch for it) would log an issue on the Virtuoso project.

frensjan commented 8 months ago

@frensjan, in But it's not that intuitive (for me, n=1). "it's" is to mean the recommendation's algebra or the result which it produces in this instance?

I meant the algebra, the semantics of the query (and thus the results produced). This may very well be biased by the fact that I do a lot of imperative programming. But the curly braces around the RHS of OPTIONAL for a lot of people probably provide some kind of hint that what's in it has the same 'data' available (or not).

Filters in OPTIONAL are special, as they have results from both LHS and RHS 'in scope'.

Variables in statement patterns on the RHS shared with the LHS form the join key, so they feel like they're 'in scope'. E.g. in SELECT * { ?s :p ?x OPTIONAL { ?s :q ?y } }, ?s will always have the same values on the LHS as on the RHS (if any compatible solution exists of course).

But for e.g. BIND this isn't the case. So you can write:

SELECT ?price ?discount {
    VALUES ?price { 10 }
    OPTIONAL {
        VALUES ?discount { 0.10 }
        FILTER( ?price * (1 - ?discount) < 10 )
    }
}

But this query has different semantics:

SELECT ?price ?discount {
    VALUES ?price { 10 }
    OPTIONAL {
        VALUES ?discount { 0.10 }
        BIND( ?price * (1 - ?discount) AS ?effectivePrice )
        FILTER( ?effectivePrice < 10 )
    }
}
frensjan commented 8 months ago

But perhaps I digress too much. This issue is about LATERAL, not OPTIONAL.

If RDF4J would be 'fixed' though, I would need to train my users in the difference between P { LATERAL { OPTIONAL Q } } and P { OPTIONAL { LATERAL Q }.

E.g., this query gives, for every ?movie, the average rating of all movies:

SELECT * {
    ?movie a :Movie
    OPTIONAL { LATERAL {
        { SELECT (AVG(?rating) as ?averageRating) {
            ?movie :rating ?rating
        } }
    } }
}

That's just hard to grok for me.

afs commented 8 months ago

What am I (@afs) missing?

That the OPTIONAL is has normal evaluation so the RHS is

{ LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2 }

No LHS ?s until after LATERAL is evaluated. The LHS of the LATERAL is the unit { }.

VladimirAlexiev commented 7 months ago

https://ceur-ws.org/Vol-3471/paper8.pdf by @Aklakan Scaling RML and SPARQL-based Knowledge Graph Construction with Apache Spark sec 4.1. Merging CONSTRUCT Queries using LATERAL