Addressing SPARQL EXISTS errata

Recap

TPAC 2023 presentation

Issues: sparql-query/issues for EXISTS

After TPAC 2023, an email was sent to interested parties.

Proposals

1:: Improved substitution SEP-0007/Substitution

2:: SemiJoin/Antijoin https://w3c.github.io/sparql-exists/docs/sparql-exists.html#proposal-a

Proposal 1

Proposal 1 is based on errata for the "Substitution" operation.

Full details including relationship to errata: SEP-0007/Substitution.

Proposal 2

Proposal 2 is SemiJoin/AntiJoin.

SPARQL already has MINUS which is an antijoin with a special condition for the case of disjoint domains (a decision of the SPARQL 1.1 working group).

A way forward.

A compromise way forward:

Replace "Substitute" with the errata-derived fix SEP-0007/Substitution
Plan for adding LATERAL, SEMIJOIN and ANTIJOIN (both pure forms) to the SPARQL language. This may have to be additional features added in the "new features" phase due to timing.

This was discussed during the rdf-star meeting on 26 September 2024.

View the transcript

Addressing SPARQL EXISTS errata 4

ora: Are there people fine with the current syntax?

ora: In any case, chairs will discuss this, let's move on

AndyS: [about SPARQL EXISTS] There are two proposals

AndyS: 1. substitution based on various existing errata

AndyS: 2. an other one based on ANTIJOIN. We already have MINUS. Except the behavior with disjoin domain. But outside of it it's ANTIJOIN

AndyS: On an other note, there are other things that might go to SPARQL like LATERAL that can be based on substitution. And pure form of anti join and semi join

AndyS: It's a possibility to move these additions (LATERAL, anti join...) to sparql dev

pchampin: we would add more subtly differences between operators like FILTER NOT EXISTS vs MINUS

pchampin: Your point of having multiple ways might create problems

ora: SPARQL spec spends a bit of time presenting this difference

AndyS: It was quite contentious in SPARQL 1.1

<pchampin> I'm more than happy to let the editors decide on that

AndyS: I am not aware of any outgoing opinion, I think it ends up to a choice on which way to go

tl: is it related to triple terms in any way of is it a SPARQL errata

AndyS: it has nothing to do with triple terms

tl: what is the criteria of SPARQL errata to discuss now?

tl: it's a central issue, is that the argument?

pfps: There are a bunch of problems with SPARQL, the ones with EXIST are the biggies

pfps: They end up splitting the SPARQL implementation space

pfps: The decision that has to be made is to move SPARQL EXIST toward a more database-like implementation and keep it more consistent with the existing

AndyS: The current implementation is present in SQL with correlated subqueries

pfps: if you use the semi/anti join interepretation of EXISTS you change SPARQL more than the other option

pfps: In the end people who will see and understand the differences are very few

ora: I would like to know preferences

AndyS: My preference is for substitution and applying errata (option 1)

pfps: I don't have much of a horse in this race

pfps: Idealy I would love to get more SPARQL developers on board

ora: we could talk outside of the group

ktk: I reached out to stardog but not got an input

gtw: I am not sure much value to reach out to more developers. sparql-dev has been opened for a long time

<pchampin> Tpt: I have a signicant preference for option 1; option 2 is basically equivalent to MINUS

pfps: One way to check the issue would be to pull some tests

<pfps> which PR?

<gkellogg> w3c/rdf-tests#42

<gb> Issue 42 tests to document current definition of EXISTS in SPARQL (by pfps) [SPARQL]

<gkellogg> w3c/rdf-tests#43

<gb> CLOSED Pull Request 43 Add tests to document current definition of EXISTS (by pfps)

ora: Whatever solutions we pick, someone will ask why we pick it

AndyS: picking sustitution breaks the least queries

ora: That seems to me a as good reason as any, let's make a decision

tl: I would like to ask james about it

ora: Let's vote on it next Thursday

ora: Let's do it

Here's my 2c as a query engine tech lead at Stardog: I prefer fixing the substitution semantics, making it a part of the spec, and keeping EXISTS as a form of correlated semi-join based on substitution. Two main reasons:

fixing substitution is important beyond EXISTS. It's naturally used, for example, for parameterised SPARQL queries (not standardised in 1.1). It has uses in SHACL (SPARQL constraints). It's important that its semantics is clear and understandable. However, it tends to be much, much trickier than most SPARQL users tend to think and there're multiple, non-equivalent ways how it can be defined.
if [NOT] EXISTS becomes an uncorrelated semi[anti]-join with the standard bottom-up evaluation semantics, I don't see how we can still support cases like
```
SELECT * {
?s :p ?o
FILTER NOT EXISTS { [] :q ?t . FILTER (?t > ?o)  }
}
```
I have seen lots of queries like this over the years where the bottom-up eval would produce zero results (I know this because Stardog, like many relational query engines, has a de-correlation optimiser and that has to carefully analyse for this sort of cases).

Eventually, I would really like to see LATERAL as a part of a future SPARQL standard and that also will require substitution (Stardog implements it using the SERVICE syntax). I think it's important that the next release of the spec actually uses the term "correlated" in the text when it articulates the differences between EXISTS and MINUS, so that introduction of general correlated subqueries through LATERAL then looks like a natural next step in the process.

@domel sent us this "questionnaire" by email. I'll paste it below, because I find it a useful roadmap to this topic.

The RDF-star Working Group is currently addressing issues related to updating the semantics of SPARQL EXISTS, specifically regarding:

Certain uses of EXISTS being undefined during evaluation,
Substitution occurring where definitions apply only to variables,
Blank nodes being substituted into BGPs and acting as variables,
Substitution potentially flipping MINUS into its disjoint-domain case,
Substitution impacting disconnected variables.

Would you be able to provide your insights on these matters

@domel sent us this "questionnaire" by email. I'll paste it below, because I find it a useful roadmap to this topic.

The RDF-star Working Group is currently addressing issues related to updating the semantics of SPARQL EXISTS, specifically regarding:

Certain uses of EXISTS being undefined during evaluation,

Substitution occurring where definitions apply only to variables,

Blank nodes being substituted into BGPs and acting as variables,

Substitution potentially flipping MINUS into its disjoint-domain case,

Substitution impacting disconnected variables.

Would you be able to provide your insights on these matters

This section of SEP-0007 expands on these points: https://github.com/w3c/sparql-dev/blob/main/SEP/SEP-0007/sep-0007.md#identified-issues

Dear all, Hannah (@hannahbast) and Johannes (@joka921) here from the University of Freiburg and developers of QLever. Fascinating and important question. There is a lot to untangle here, so first

TLDR: In border cases regarding the semantics of a complex query, we recommend to follow the inner logic of the query language and not what a user thinks the query should do. Following that, we recommend to define EXISTS via a standard semijoin. That being said, the SPARQL standard currently misses some important features, but that is a different topic and should not be conflated with the topic under discussion here.

Here is the long version:

There are two principle ways to determine what the semantics of a certain query should be. One is based on the internal logic of the query language. The other is based on what users think a certain query does or should do. The two should typically coincide, otherwise something would be very wrong with the design of the query language. But the definition of anything non-trivial is bound to have some unintuitive border bases.
The definition of the semantics of EXISTS has several such border cases, notably when the right side uses filters on variables that only occur on the left side. It is aggravated by the fact that most users don't really think about EXISTS as the functional form it is, but about its usage in combination with a filter, as in FILTER NOT EXISTS, and then wondering why on earth SPARQL has both MINUS and FILTER NOT EXISTS, which are very similar but not equal. But that is really just a side-effect of EXISTS. So better think about EXISTS in isolation, like in this simple example query: three people with their name and whether they have a birth date in Wikidata or not.
Most SPARQL features translate to some form of join operation when processing the query. For example, two graph patterns with joint variables can be processed by a standard join operation: the result is all pairs of solutions from the left and from the right, where the joint variables have the same value. One graph pattern MINUS another can be processed by an antijoin: the result is only those solutions from the left, which don't match one from the right. Similarly, OPTIONAL translates to a left outer join.
All these join operations have an important property: the left and right side can be evaluated independently, and the computational complexity is linear in the size of the input plus the size of the output (modulo some details related to UNDEF values omitted here).
Following that logic, it would make a lot of sense to define EXISTS via a semijoin: That is, each solution on the left evaluates to true if and only if it matches a solution on the right. This includes the special case, where there are no variables in common: then all solutions on the right match, just like the would in a normal join, and each solution on the left evaluates to true (and this also explains the difference between FILTER NOT EXISTS and MINUS).
Unfortunately, the standard didn't define EXISTS that way, but chose to define it via substitution. Substitution is conceptually different from a join-like operation because it is a quadratic algorithm: for each solution on the left, perform the corresponding substitution on the right and evaluate it. It is true that one can reduce this to ordinary joins in many cases (a practice called de-correlation or unnesting, here is a good paper about it). But the current definition does go against the inner logic of all the other SPARQL constructs, as sketched above. And you can already tell that something is fishy by looking at the standard, where "substitute" is defined solely for the purpose of defining EXISTS and used nowhere else: https://www.w3.org/TR/sparql11-query/#sparqlAlgebraEval . The many issues with EXISTS are a direct consequence of this special treatment.
All that being said, there are important features missing from SPARQL. One notable example is queries like the highest peak of each country, where you have a GROUP BY (for each country, all the peaks), an aggregation (find the maximum height per country), but then want the value from another column corresponding to the result of the aggregation (the peak which has the maximum height). SPARQL currently forces you to write a very unnatural query for this, where parts of the query have to be repeated in a subquery. Another example are parametrized queries like all streets in region X (click on "Map view" to see the result on a map), where you want to define one or more parameters at the top, which are then used in the rest of the query. This works for simple queries, but with current SPARQL breaks as soon as you have subqueries or nested group graph patterns using the query parameters, which is confusing for many users.
These missing features are related to the "substitute" semantics and to nested queries and should definitely be addressed and we do have recommendations for that. But we strongly advise against conflating this with the current discussion, which is about EXISTS and not about all kinds of missing features. When implementing new features that depart from the join logic inherent to the current version of the SPARQL standard, we would advise to use new keywords (LATERAL being an example) in order to make the departure from the standard logic explicit.

This was discussed in today's meeting

w3c / sparql-query

Addressing SPARQL EXISTS errata #156