w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
121 stars 19 forks source link

JOIN LATERAL or Correlated Subquery #100

Open VladimirAlexiev opened 5 years ago

VladimirAlexiev commented 5 years ago

Why?

GraphQL is a hot topic amongst developers and tool vendors. There are several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation.

GraphQL queries are hierarchical and quite regular. Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level.

Assuming a simple companies graph structure and some reasonable order and limit syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL:

{
  country(id:"...bulgaria") {
    city (order: {population:DESC}, limit: 2) {
      id name population
      company (order: {revenue:DESC}, limit: 2) {
        id name revenue
      }
    }
  }
}

If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem. limit in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city.

We at Onto believe that to implement this efficiently, you need the subquery to run in a loop for every row of the parent query.

Previous work

This is a common problem in databases. Eg see StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?, which gives the following solutions:

  1. <child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>) aka using Windowing functions
  2. <parent-query> JOIN LATERAL (<child-query>), see PostgreSQL (FROM, Lateral, SELECT, heap.io blog Dec 2014), SQL Server (cross apply)
  3. WITH Common Table Expression
  4. JOIN (COUNT...GROUP BY) WHERE <=5

A Correlated Subquery is like LEFT JOIN LATERAL ... ON true, see StackOverflow: Difference between Lateral and Subquery:

Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning). However, LATERAL is usually faster than Windowing functions (depending on data and indexing).

Two of the leading GraphQL implementations for RDMBS use LATERAL: Hasura and Join Monster

Proposed solution

A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows.

(Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows)

country city city_name population company company_name revenue COMMENT
geo:732800/ geo:727011/ Sofia 1152556
geo:727011/ co:123 Sofia Foo Co 987 Sofia company 1
geo:727011/ co:456 Sofia Bar Co 123 Sofia company 2
geo:732800/ geo:728193/ Plovdiv 340494
geo:728193/ co:789 Plovdiv Foo Co 987 Plovdiv company 1
geo:728193/ co:012 Plovdiv Bar Co 123 Plovdiv company 2

Assume this construct (other syntax suggestions are welcome!):

{foo} UNION LATERAL(?var) {bar}

Then we could use it to implement the query in question:

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  {select ?country ?city ?city_name ?population {
    bind(<http://sws.geonames.org/732770/> as ?country)
    ?country x:city ?city.
    ?city x:name ?city_name.
    ?city x:population ?population.
  } order by desc(?population) limit 2}
  UNION LATERAL(?city)
  {select ?city ?company ?company_name ?revenue {
    ?city x:company ?company.
    ?company x:name ?company_name.
    ?company x:revenue ?revenue
  } order by desc(?revenue) limit 2}
}

(It's more likely to have inverse links ?city x:country ?country and ?company x:city ?city but for simplicity we use straight links)

Considerations for backward compatibility

None?

afs commented 5 years ago

Related: "Query hints" #71

VladimirAlexiev commented 3 years ago

https://www.stardog.com/labs/blog/stored-query-service/ explains how to call stored queries in Stardog (similar to subqueries but providing better modularity). It also includes ideas how to use a similar mechanism for correlated (LATERAL) joins and cites this issue.

@kasei, @afs, @jeenbroekstra and @JervenBolleman:

@rubensworks Do you agree with Onto that this issue is quite crucial for implementing GraphQL fully and correctly?

namedgraph commented 3 years ago

Any example where this would be relevant that is not related to GraphQL?..

klinovp commented 3 years ago

An example how a correlated join is a useful addition to un-correlated ones? It's exactly the same situation as in SQL and there're lots of use cases, the classical one is something like "you have a database of products and a set of offers for each product, you want to get 5 highest price offers for each product".

GraphQL is one important use case here but this is a core SPARQL extension proposal which is perfectly valid outside of the GraphQL context.

JervenBolleman commented 3 years ago

I think this issue is interesting enough to do a SEP on. Reading through this LATERAL proposal, I suspect that it overlaps with the PREFER SEP (proposed for #13) and maybe that can be extended. The syntax/usecase is different, but execution wise I suspect the overlap is much higher. As in PREFER is a LATERAL query that must have an ORDER BY and fixed LIMIT of 1. Which I had not thought of, and considering I only had my first coffee I am willing to be corrected on ;)

rubensworks commented 3 years ago

Do you agree with Onto that this issue is quite crucial for implementing GraphQL fully and correctly?

I fully agree on the importance of this issue for SPARQL in general.

I wouldn't call it crucial for GraphQL per se, as different interpretations are possible on how to apply it to RDF data. But for the interpretation that Ontotext has picked, it definitely makes sense.

lisp commented 3 years ago

ignore sql machinations related to lateral and correlated subqueries, to concentrate just on the intent to produce from some dependent sub-expression some number of solutions for each solution of some dominant subexpression.

should not the general "sidewards information passing" mechanism suffice?

is this best accommodated with a union variant?

is it not more a matter of just declaring top-down rather than bottom-up reduction - analogous to the conflagration surrounding EXISTS? in which case an explicit SUBSELECT operator and/or something like an EACH modifier would suffice to declare intent and leave it to the processor to arrange for the s-i-p data/control flow in any and all join forms. as in (in simplified form)

select * {
  {select * where { ?country x:city ?city. } limit 2}
  {SUBSELECT * where { ?city x:company ?company. } limit 2}
}

or

select * {
  {select * where { ?country x:city ?city. } limit 2}
  {SELECT * where { ?city x:company ?company. } LIMIT (EACH 2) }
}
jaw111 commented 3 years ago

@lisp from the perspective of a SPARQL user (rather than implementer) it would seem more intuitive to have a somewhat bottom-up phrasing for this. Based on your example:

select *
where { ?city x:company ?company. }
limit 2
for each {
  select *
  where { ?country x:city ?city. }
  limit 2
}
lisp commented 3 years ago

given the implicit control flow which applies to the rest of the language, this would be

for each {
  select *
  where { ?country x:city ?city. }
  limit 2 }
{ select *
  where { ?city x:company ?company. }
  limit 2 }
VladimirAlexiev commented 2 years ago

https://medium.com/agnos-ai/sql-features-not-supported-in-sparql-1-1-eb34e3519077#7d71 AGNOS blog giving some examples, but in SQL using the "Northwind" MS SQL example database

VladimirAlexiev commented 2 years ago

@lisp We need a syntax showing the dependency (major-minor) explicitly (as in your https://github.com/w3c/sparql-12/issues/100#issuecomment-858171838).

A symmetric syntax like your https://github.com/w3c/sparql-12/issues/100#issuecomment-705485097 won't work because:

@jaw111 Please let's not spell it minor-major. (Do you write RTL by any chance? ;-)

Here's an example of deeper nesting, and each level has a dependency on the previous level:

VladimirAlexiev commented 2 years ago

@klinovp on twitter: We added correlated subqueries to SPARQL: https://docs.stardog.com/query-stardog/stored-query-service#correlated-subqueries It's similar to LATERAL in Postgres.

lisp commented 2 years ago

what are the circumstances that require the sqs:inputs mechanism?

jaw111 commented 2 years ago

@klinovp no RTL experience here.

Looking back to the original proposed solution, I struggle with the UNION clause, but could that simply be omitted and use a syntax for LATERAL something of a mash-up of VALUES and FILTER EXISTS to indicate the variable(s) for which bindings are to be passed sideways/laterally to the inner pattern/subquery.

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  {
    select ?country ?city ?city_name ?population {
      bind(<http://sws.geonames.org/732770/> as ?country)
      ?country x:city ?city.
      ?city x:name ?city_name.
      ?city x:population ?population.
    }
    order by desc(?population)
    limit 2
  }
  LATERAL ?city {
    select ?city ?company ?company_name ?revenue {
      ?city x:company ?company.
      ?company x:name ?company_name.
      ?company x:revenue ?revenue.
    }
    order by desc(?revenue)
    limit 2
  }
}

One could also pass in multiple variables and make the minor part optional. The outer part need not be a subquery:

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  bind(<http://sws.geonames.org/732770/> as ?country)
  ?country x:city ?city.
  ?city x:name ?city_name.
  ?city x:population ?population.
  optional {
    LATERAL ?city ?country {
      select ?city ?company ?company_name ?revenue {
        ?city x:company ?company.
        ?company x:name ?company_name.
        ?company x:revenue ?revenue.
        ?company x:headquarteredIn ?country .
      }
      order by desc(?revenue)
      limit 2
    }
  }
}

Similarly, what is in the lateral might just be a graph pattern rather than a subquery. This could be used to give hints to the query processor whether to pass in solutions it already has when evaluating an optional clause (nest loop join) or use bottom up approach.

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  bind(<http://sws.geonames.org/732770/> as ?country)
  ?country x:city ?city.
  ?city x:name ?city_name.
  ?city x:population ?population.
  optional {
    LATERAL ?city ?country {
      ?city x:company ?company.
      ?company x:name ?company_name.
      ?company x:revenue ?revenue.
      ?company x:headquarteredIn ?country .
    }
  }
}
VladimirAlexiev commented 2 years ago

@jaw111 thanks for your syntax proposals!

1) If the main is not a subquery, would we put order/limit at the bottom as for a normal query?

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  bind(<http://sws.geonames.org/732770/> as ?country)
  ?country x:city ?city.
  ?city x:name ?city_name.
  ?city x:population ?population.
  optional {
    LATERAL ?city ?country {
      select ?city ?company ?company_name ?revenue {
        ?city x:company ?company.
        ?company x:name ?company_name.
        ?company x:revenue ?revenue.
        ?company x:headquarteredIn ?country .
      }
      order by desc(?revenue)
      limit 2
    }
  }
} order by desc(?population) limit 2

2) We should always think of the case of multiple nestings:

     major
       minor
         sub-minor

So I'd favor syntaxes that reflect that nesting in the query structure.

lisp commented 2 years ago

the same question applies to the LATERAL proposal as to sqs:inputs: what are the circumstances which require the declaration for the corresponding variables?

VladimirAlexiev commented 2 years ago

@lisp minor should execute for each solution of major. If they are merely patterns (not subqueries) then I think you are right, it's not necessary to declare the lateral variables. If they are subqueries, they declare their exposed variables, and again it's not necessary to declare them in LATERAL.

Aklakan commented 1 year ago

For the just released Jena 4.6.0 I contributed the "service enhancer" plugin which has lateral joins as one of its features: https://jena.apache.org/documentation/query/service_enhancer.html It's the first release so consider it experimental.

Example: Fetch one English label for every entity:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?s ?l {
  VALUES ?s { wd:Q1686799 wd:Q54837 wd:Q54872 wd:Q54871 wd:Q108379795 } 
  SERVICE <loop:https://query.wikidata.org/sparql> { # SERVICE <loop:> {...} lateral joins with active dataset
    SELECT ?l {
      ?s rdfs:label ?l
      FILTER(langMatches(lang(?l), 'en'))
    } ORDER BY ?l LIMIT 1
  }
}

right, it's not necessary to declare the lateral variables. If they are subqueries, they declare their exposed variables, and again it's not necessary to declare them in LATERAL.

That was also my finding that its not necessary to declare the variables. In principle, lateral joins require different scoping rules as described in the linked document.

frensjan commented 1 year ago

Question: could someone comment on the semantics of multiple / consecutive lateral joins? Does lateral join n extend lateral join n-1? Or would this be based on shared variables?

An example use case could be to

Example query ```sparql select ?thread ?title ?postTitle ?postAuthor ?topAuthor { ?thread a :Thread ; :title ?title . lateral { select (?title as ?postTitle) (?author as ?postAuthor) { ?post :postedIn ?thread ; :title ?title ; :author ?author ; :postedOn ?postedOn . } order by desc(?postedOn) limit 5 } lateral { select (?author as ?topAuthor) { ?post :postedIn ?thread ; :author ?author . } group by ?author order by desc(count(*)) limit 3 } } ```

The key here is that the two 'lateral clauses' are meant to be unrelated to each other and only be related to the top-level query.

Tpt commented 1 year ago

@frensjan If I understand @afs proposal properly, LATERAL works just like OPTIONAL i.e. variables bound inside of the first LATERAL clause will be considered when evaluating the second LATERAL clause. To use SPARQL SSE notation, X LATERAL { Y } LATERAL { Z } is parsed to (lateral (lateral X Y) Z)

afs commented 1 year ago

@frensjan - interesting example.

In the current proposal, it is as @Tpt describes. The second lateral takes the results from the first as input. With no shared variables, that is in effect a cross-product.

What shape of result set are you trying to produce? there can be multiple lines for the same?thread, ?title pairs in each lateral.

The description feels like it is a an "details for A, details for B" (A and B themselves not connected) which suggests a UNION somewhere - that does not get information from each branch on the same row. But if there are 5 rows from the first lateral, for one (thread,title) what is the desired outcome for the 3 rows from the second for that (thread,title)? There isn't a nested table feature in SPARQL - GROUP_CONCAT can be used but it isn't ideal.

frensjan commented 1 year ago

Thanks for the responses @afs and @Tpt; understood and as I expected. For as long as the size of the 'legs' is small, the costs of such a cross-product could be acceptable.

The use case would be to show related information in an entity listing / table or in an entity detail view. I.e. the name and number of posts for the OP and the number of posts per thread of a listing of threads.

Using a pseudo GraphQL query this could be something like:

{
    threads(limit:10) {
        title
        creationDate
        postCount       # lateral join to count(?post) over ?post :parent ?thread
        creator {       # lateral join to get ?creator :created ?thread
            name        # fetch ?creator :name ?name
            postCount   # count(?post) over ?creator :created ?post
        }
    }
}
A more elaborate example for vulnerabilities ```graphql { cve(id:"https://nvd.nist.gov/vuln/detail/CVE-2022-42003") { title description ... link { url tags { label } } weaknesses { id name source } affected { cpe { pattern version { from to } instances(limit:10) { part vendor product version update edition ... } } } } } ```

Use cases where information is tracked over time it would be useful to show the latest value for a number of attributes; e.g. based on a timestamp associated with the graph such triple occurs in. If the application can assume these attributes change values atomically in this example in the same named graph, this could be solved by one lateral join to the named graph with the last changes. But reality can be stubborn in my experience.

GROUP_CONCAT would only work for primitive values and not for more elaborate (sub-)solution. Ideally, this could be combined with the FRAMED ideas that are also discussed for SPARQL 1.2.

Note that while I'm using GraphQL here as examples, it is intended as pseudo code / a means to convey intent. I think these use-cases hold regardless of whether GraphQL is part of the technology stack.

lisp commented 1 year ago

with each re-reading, i am a more convinced that the operator at issue should be interpreted in combination with the same solution state as any other operator and affect just the binding environment. it should have no special effect on the algebra. this could simplify both its implementation and its comprehension.

any of the following should be possible

{ ... } optional {  lateral x y z { ... } }
{ ... } lateral x y z { { ...} optional { ... } }
{ ... } union { lateral x y z { } }

the combinations are those of the existing sparql operators. in the above, the second denotes a join, as that is sparql's implicit operator.
in a manner similar to how "select" describes just projection, "lateral" (or whatever it is called) describes just an initial environment.

JervenBolleman commented 1 year ago

@lisp that is very interesting, would you mind proposing this as a pull request on 0006 or even as an alternative SEP if you feel that would be easier?

afs commented 1 year ago

@lisp Some examples would help and I may have understood your suggestion.

Lateral join is the name used in SQL but it isn't really a join as in "evaluate left and right to a table, then combine the tables". I find the name counter-intuitive, but the terminology is already out there.

Is that use of "lateral" is the substitute operator?

https://github.com/w3c/sparql-12/blob/main/SEP/SEP-0006/sep-0006.md#evaluation-1

            pattern2 = inject(pattern, μ1)
            Ω1 = eval(D(G), pattern2)

What would happen to the variable in the RHS that are not in the "lateral xyz"? Is the substitution deep or shallow in "lateral x y z { { ...} optional { ... } }"?

afs commented 1 year ago

example1.zip

Here are three queries that might be used in the GraphQL case (if it's one query for the whole GraphQL request) on the data

:s1 a :T .
:s1 :p "s1-p-1" , "s1-p-2" , "s1-p-3" .
:s1 :q "s1-q-1" , "s1-q-2" , "s1-q-3" .

:s2 a :T .
:s2 :p "s2-p-1" , "s2-p-2" , "s2-p-3" .
:s2 :q "s2-q-1" , "s2-q-2" ,"s2-q-3" .

The first is two laterals, each setting an independent variable p and q, and run one after another. The effect is repetition. For a value of p, there two rows of q.

SELECT * {
    ?s a :T .
    LATERAL { SELECT * { ?s :p ?p } ORDER BY ?p LIMIT 2 }
    LATERAL { SELECT * { ?s :q ?q } ORDER BY ?q LIMIT 2 }
}
-----------------------------
| s   | p        | q        |
=============================
| :s2 | "s2-p-1" | "s2-q-1" |
| :s2 | "s2-p-1" | "s2-q-2" |
| :s2 | "s2-p-2" | "s2-q-1" |
| :s2 | "s2-p-2" | "s2-q-2" |
| :s1 | "s1-p-1" | "s1-q-1" |
| :s1 | "s1-p-1" | "s1-q-2" |
| :s1 | "s1-p-2" | "s1-q-1" |
| :s1 | "s1-p-2" | "s1-q-2" |
-----------------------------

The second has a UNION

SELECT * {
    ?s a :T .
    { LATERAL { SELECT * { ?s :p ?p } ORDER BY ?p LIMIT 2 } }
    UNION 
    { LATERAL { SELECT * { ?s :q ?q } ORDER BY ?q LIMIT 2 } }
}
-----------------------------
| s   | p        | q        |
=============================
| :s2 | "s2-p-1" |          |
| :s2 | "s2-p-2" |          |
| :s2 |          | "s2-q-1" |
| :s2 |          | "s2-q-2" |
| :s1 | "s1-p-1" |          |
| :s1 | "s1-p-2" |          |
| :s1 |          | "s1-q-1" |
| :s1 |          | "s1-q-2" |
-----------------------------

The third is similar to second except it is the same variable being set in each LATERAL and the two LATERAL blocks put in a marker value.

SELECT * {
    ?s a :T .
    { LATERAL { SELECT * { ?s :p ?o } ORDER BY ?o LIMIT 2 } BIND("A" AS ?x) }
    UNION 
    { LATERAL { SELECT * { ?s :q ?o } ORDER BY ?o LIMIT 2 } BIND("B" AS ?x) }
}
------------------------
| s   | o        | x   |
========================
| :s2 | "s2-p-1" | "A" |
| :s2 | "s2-p-2" | "A" |
| :s2 | "s2-q-1" | "B" |
| :s2 | "s2-q-2" | "B" |
| :s1 | "s1-p-1" | "A" |
| :s1 | "s1-p-2" | "A" |
| :s1 | "s1-q-1" | "B" |
| :s1 | "s1-q-2" | "B" |
------------------------
frensjan commented 1 year ago

I get the intuition behind the use of UNION here. Would this be valid syntax given incorporation of SEP 0006? Or this to be taken as a proposal to extend SEP 0006?

However, I can't grasp how this would translate to the SPARQL algebra. I get the impression - from SPARQL 1.1. spec, the examples in the SPARQL spec on combinations of UNION and OPTIONAL as analogy and the changes from SEP 0006 - that the second lateral would be joined with the empty graph pattern instead of with the ?s a :T. I.e.:

(union
    (lateral
        (bgp (triple ?s a :T))
        ( ... subselect ))

    (lateral
        (Z)                         # the empty BGP
        ( ... subselect )))

I don't see how the ?s a :T pattern would be distributed over the lateral clauses.

The following expression doesn't have this issue, but this would yield really large expressions/queries if the queries to be laterally joined to would be of any real-world interest, this isn't very user friendly, the redundancy in the query would be an obvious point for errors, and it is perhaps harder for query optimisers to hoist the repeated GroupGraphPattern out of the union.

(union
    (lateral
        (bgp (triple ?s a :T))
        ( ... subselect ))

    (lateral
        (bgp (triple ?s a :T))
        ( ... subselect )))

From an algebra point of view I guess it would be ideal if to define lateral as an expression with one or more arguments, where arguments 2 and beyond would be evaluated for solutions from the evaluation of the first argument. E.g. to construct expressions such as:

(lateral
    (bgp (triple ?s a :T))  # basic graph pattern to laterally join to
    ( ... subselect 1 )     # laterally joined with arg 1
    ( ... subselect 2 ))    # laterally joined with arg 1

However, having LATERAL as an infix keyword, I don't see a straightforward grammar for this. Or is that your intent for the union of two lateral graph patterns to result in just that?

afs commented 1 year ago

Hi @frensjan

Yes - you're correct. There must be a bug at my end.

The second query with the union should be

SELECT * {
    ?s a :T .
    LATERAL {
      { SELECT * { ?s :p ?p } ORDER BY ?p LIMIT 2 }
      UNION
      { SELECT * { ?s :q ?q } ORDER BY ?q LIMIT 2 }
    }
}

The algebra is:

  (lateral 
    (bgp (triple ?s rdf:type :T))    # LHS of LATERAL
    (union                           # RHS of LATERAL
      (slice _ 2
        (order (?p) (bgp (triple ?s :p ?p))))
      (slice _ 2
        (order (?q) (bgp (triple ?s :q ?q))))))

to give the shown results for query 2.

It does avoid cross product growth.

It would need another ORDER BY ?p ?q ?s on the overall query to group the ?p and ?q together.

-----------------------------
| s   | p        | q        |
=============================
| :s1 |          | "s1-q-1" |
| :s1 |          | "s1-q-2" |
| :s2 |          | "s2-q-1" |
| :s2 |          | "s2-q-2" |
| :s1 | "s1-p-1" |          |
| :s1 | "s1-p-2" |          |
| :s2 | "s2-p-1" |          |
| :s2 | "s2-p-2" |          |
-----------------------------

example2.zip has fixed the version.

Good example to add to a test suite.

lisp commented 1 year ago

What would happen to the variable in the RHS that are not in the "lateral xyz"? Is the substitution deep or shallow in "lateral x y z { { ...} optional { ... } }"?

"lateral" is a situation similar to "exists".

with this approach - contrary to my earlier belief, the "lateral" form should declare the environment's variables. these provide initial bindings for variables in any bgp forms and any otherwise free variables in its scope/extent.

if one feels that the tests in

https://github.com/apache/jena/tree/main/jena-arq/testing/ARQ/Lateral

are adequate, i will go ahead and implement "lateral" based on the same mechanism which we use for "exists" and report results for those tests.

afs commented 1 year ago

"dynamic binding mechanism" and "injection" can be viewed as implementations of each other.

"dynamic binding mechanism" could be an implementation of SEP-0007 - lookup a variable name in the evaluated LHS. The difference with SEP-0007 is only that SEP-0007 formalizes the outcome as algebra by the injection step that puts the value immediately next to the variable use.

lisp commented 1 year ago

"dynamic binding mechanism" and "injection" can be viewed as implementations of each other.

this is not true.
the former concerns environments while the latter - even in the later formulations such as sep0007, concerns forms.
its problems are inherent in that approach.
as discussed, at length, during the correspondence months ago regarding exists, a mechanism based on environments does not introduce the issues which sep0007 endeavours to resolve.

afs commented 1 year ago

Then I don't understand what you mean by "dynamic binding mechanism" if it is not as I described. "injection" is applied per row, not statically during algebra generation, so in that sense it is dynamic.

Could you give an example of the difference please?

lisp commented 1 year ago

one example is the way blank nodes are manipulated. one aspect of the "injection" discussion concerns how to manage blank nodes which are introduced into a bgp. where the operation occurs in the expression domain, one has to realize that that blank nodes are not to be treated as variables. where the operation is performed by binding variables in an environment, the issues can not arise as the domain is that of values, not lexical expressions.

afs commented 1 year ago

Injection in SEP-0007 leaves variables in-place. It does not introduce a blank node into a BGP. The original substitute did do so and that was a problem.

Per row: μ, binding ?x = <uri> and BGP (?x ?p ?o) becomes (join (?x ?p ?o) Table(μ)).

The restriction on ?x is achieved by an algebra join involving values. This is done for any occurrence of the 3 places variables can be matched in SPARQL (it excludes AS ?x).

join isn't the only possible choice. It's advantage is that it copes with an empty solution mapping to become the join-identity and it uses an existing operator. A new (inject μ bgp) would be possible.

Restriction after binding does work for the three places needed. It does not work for a general pattern.

The same concern arises for MINUS where replacement semantics can change the domain and can cause unexpected changes.

SEP-0007 would benefit from some examples.

lisp commented 1 year ago

the problem arises because, when i read

Definition: Values Insertion
For solution mapping μ, define Table(μ) to be the multiset formed from μ.
    Table(μ) = { μ }
    Card[μ] = 1
Define the Values Insertion function Replace(X, μ) to replace each occurrence Y of a
  Basic Graph Pattern, Property Path Expression,  Graph(Var, pattern) in X
  with join(Y, Table(μ)).

in sep-0007, it is not evident how and when this is to be accomplished. the mechanism suggested by text there regarding addressing issue 3, "the possible solutions are restricted by the current row", is underdefined.
the text, above, gives a better idea.

afs commented 1 year ago

@lisp - thank you for highlighting what is not clear. I'll rework that definition text and add examples.

lisp commented 1 year ago

if one feels that the tests in

https://github.com/apache/jena/tree/main/jena-arq/testing/ARQ/Lateral

are adequate, i will go ahead and implement "lateral" based on the same mechanism which we use for "exists" and report results for those tests.

before i discuss the results, there is a question about the tests. as i read them, i would expect that the dependent clauses in the lateral-4 and lateral-5 tests would use the :label predicate, rather than the :p predicate which appears in the query texts.

jaw111 commented 1 year ago

@lisp I agree with your reading based on the variable names, but the presented .srj results are consistent with the <http://example/p> predicate used in the queries and the dataset.

Seems like something for @afs to give a comment on.

afs commented 1 year ago

@lisp Thanks.

The use of ?label is a bit of a distraction - renamed.

lisp commented 1 year ago

i looked at (some of) the literature on this issue, thought about how to integrate an approach into dydra, and completed an initial implementation.
the current mechanism distinguishes those cases which must employ dynamic environments from those for which an alternative aggregation operator suffices.

the discussion and examples are available for review in a notebook.

VladimirAlexiev commented 1 year ago
lisp commented 1 year ago

i understand the connection to #161, but, if you elect to introduce an operator, is it still necessary? if so, please describe the cases which require it.

frensjan commented 6 months ago

With regards to LATERAL enhancement and its companion enhancement for inject, how should the following query be interpreted?

SELECT * {
    ?s :value ?v
    LATERAL {
        SELECT DISTINCT ?o {
            ?o :value ?w .
            FILTER(?w > ?v)
        }
    }
}

The intent of the query is to select the ?os that have a :value ?w larger than the :value ?v of ?s.

However, based on SEP-0007 it would seem that ?v isn't bound in the evaluation of the sub-select because it isn't in its projection. This can obviously be fixed:

SELECT * {
    ?s :value ?v
    LATERAL {
        SELECT DISTINCT ?v ?o {
            ?o :value ?w .
            FILTER(?w > ?v)
        }
    }
}

However, this is a bit awkward, as ?v can never have the output role of the sub-select to use the wording from Correlation and Substitution in SPARQL.

afs commented 6 months ago

@frensjan - nice - this brings out the difference between LATERAL and parameterised queries by variable substitution.

There are two different ?v variable that never meet (i.e. get joined). As a query, the inner ?v can be replaced by any variable not used elsewhere and the query results are the same. (Jena renames the inner variable so it does not have to have code for scoping issues in other query execution code.)

This is a difference between LATERAL and parameterised queries by variable substitution. I believe they are both are valid use cases - the same machinery applies but there are slight differences in how it is used, here how to treat scope-hidden variables.

It doesn't even have to involve LATERAL:

As a parameterized query:

    SELECT ?o ?w {
         ?o :value ?w .
         FILTER(?w > ?v)
     }

makes sense. As a query, it rejects everything because the FILTER is error.

https://github.com/w3c/sparql-dev/issues/57 and others.

lisp commented 6 months ago

the change to scoping rules is one of the reasons for LATERAL. see https://observablehq.com/d/dba772c1bf5ed10f

frensjan commented 6 months ago

Thanks for the reactions. So would you @afs and @lisp say my second query should work as intended?

If so, don't you agree that it is awkward to select a variable ?v (which is typically used to expressed the variable is in the output role of the (sub-)select) to convey that it is in the input role?


@afs, what did you mean by

It doesn't even have to involve LATERAL:

The query that I gave as an example translates to something like:

SELECT * {
    ?s :value ?v
    ?o :value ?w .
    FILTER(?w > ?v)
}

But that's not really the issue I wanted to address.

A more involved example:

PREFIX : <http://example.org>

SELECT * WHERE {

    # messages and when they were created
    ?message1 a :Message .
    ?message1 :created ?referenceTimestamp .

    # for each message, the last ten messages that were created before it
    LATERAL {
        SELECT ?referenceTimestamp ?message2 {
            ?message2 a :Message .
            ?message2 :created ?created .
            FILTER( ?created < ?referenceTimestamp)
        }
        ORDER BY DESC( ?created )
        LIMIT 10
    }

}

Is the consensus that (given SEP-0006 and 0007) this would work as intended (as per the comments)?

The algebra according to Arq (click to expand) ``` (lateral (bgp (triple ?message1 ) (triple ?message1 ?referenceTimestamp) ) (project (?referenceTimestamp ?message2) (top (10 (desc ?/created)) (filter (< ?/created ?referenceTimestamp) (bgp (triple ?message2 ) (triple ?message2 ?/created) ))))) ```
I assume that the query doesn't work as intended if ?referenceTimestamp isn't selected in the sub-select. Although I would like this to work as I think this would be a lot easier to explain to my users! (click to expand) ```sparql PREFIX : SELECT * WHERE { # messages and when they were created ?message1 a :Message . ?message1 :created ?referenceTimestamp . # for each message, the last ten messages that were created before it LATERAL { SELECT ?message2 { ?message2 a :Message . ?message2 :created ?created . FILTER( ?created < ?referenceTimestamp) } ORDER BY DESC( ?created ) LIMIT 10 } } ```
afs commented 6 months ago

my second query should work as intended

Yes. ?v has to be in-scope at the top of the LHS (left hand side) of LATERAL.

afs commented 6 months ago

# for each message, the last ten messages that were created before it

Yes! That's a very important application. The "up to 10 of" for each item is very hard to do otherwise.

?/created

FWIW That's a renamed variable.

frensjan commented 6 months ago

We're fleshing out the implementation of LATERAL for RDF4J; another question to make sure we get the intended semantics right:

In the Oxigraph tests there are two queries:

PREFIX ex: <http://example.org/>

SELECT ?s ?o WHERE {
    ?s a ex:T.
    OPTIONAL { LATERAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

and

PREFIX ex: <http://example.org/>

SELECT ?s ?o WHERE {
    ?s a ex:T.
    LATERAL { OPTIONAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

From the expected results, it seems that in the first query ?s is considered out of scope for evaluation for each ?s a ex:T. while in the second query it is in scope.

For me it's surprising that the semantics of these two queries differ. I consider both OPTIONAL and LATERAL to have 'nested loop' like semantics; i.e., for each solution / binding set from the left-hand side, variable bindings are 'injected into' the right-hand side graph pattern.

I would consider { P } OPTIONAL { { } LATERAL Q } to be equivalent to { P } LATERAL { { } OPTIONAL Q }.


I did noticed a subtle difference in the algebra for Jena:

  (project (?s ?o)
    (conditional
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (lateral
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))

vs

  (project (?s ?o)
    (lateral
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (leftjoin
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))

Note the conditional vs leftjoin operators.

I haven't tested the actual behaviour of Jena yet. But as Jena uses variable renaming, I would expect ?s to be renamed to ?/s if it were out of scope.

Tpt commented 6 months ago

We're fleshing out the implementation of LATERAL for RDF4J

Amazing!

For me it's surprising that the semantics of these two queries differ. I consider both OPTIONAL and LATERAL to have 'nested loop' like semantics; i.e., for each solution / binding set from the left-hand side, variable bindings are 'injected into' the right-hand side graph pattern.

This is the trap. OPTIONAL do not have "nested loop" semantic but a "left join" semantic. So, in A OPTIONAL { B }, A and B are supposed to be evaluated independently then joined. See this definition in the SPARQL spec. For example the query

SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } }

can be rewritten

SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(false) } }

because ?a is not bound inside of the OPTIONAL and so will return only a=1 wheras

SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } }

can be rewritten

SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(true) } }

because ?a is now bound inside of the LATERAL and so will return a=1 b=2.

This is why rewriting OPTIONAL into nested loops is hard because one needs to take care of not changing the results.