SPARQL binding paradoxes

VladimirAlexiev commented 8 months ago

@frensjan @afs @TallTed @lisp @JervenBolleman (I don't even know how to define this issue: feel free to edit the title!)

@frensjan in https://github.com/w3c/sparql-dev/issues/100#issuecomment-1911693306 started a discussion on which bindings are passed between which SPARQL clauses and formulated some nice queries to exercise these questions.

I posted similar things in https://github.com/w3c/sparql-dev/issues/103 (but they are not yet reflected below).

Different SPARQL processors return different results on such basic queries :-(

Jena on https://sparql.org/sparql
Blazegraph on Wikidata endpoint
Virtuoso with Strict checking of void variables off (&signal_unconnected=on) on dbpedia.org endpoint
GraphDB up to 10.6.1 (RDF4J 4.3.9)

I don't know SPARQL algebra very well, but I guess it all comes from the bottom-up execution semantics of SPARQL.

"Canonical" below is my understanding (or what someone else said) should happen according to spec
"Intuition" is what I think should happen, or what is most "useful". Coincidentally, rdf4j's behavior matches that intuition. I do have a vested interest, so I won't claim this is the "right" behavior.

Now: I have no illusions that the group will change SPARQL semantics to fit my intuitions. But maybe some option/flag/"mode" can be added to change the treatment of bindings. At the least, this issue will serve as a big warning for the unwary.

Brackets

Intuition: adding brackets should not change the results
Canonical: brackets make a new sub-clause that is executed independently of the others (bottom-up), so in this case should return no results
```
PREFIX : <http://example.org/>
SELECT * WHERE {
VALUES ?x { :x }
{
    FILTER( BOUND(?x) )
    BIND( :y as ?y )
    BIND( ?x as ?z )
}
}
```
Jena, Blazegraph, Virtuoso: no results
rdf4j: one result :x :y :z

Optional

Intuition: OPTIONAL can only "enlarge" the result, it should not remove rows nor bindings

Canonical: in LHS optional {RHS}, LHS bindings should not be passed to RHS

PREFIX : <http://example.org/>
SELECT * WHERE {
VALUES ?x { :x }
OPTIONAL {
    FILTER( BOUND(?x) )
    BIND( :y as ?y )
    BIND( ?x as ?z )
}
}

Blazegraph, Virtuoso: one result :x
Jena: one result :x :y but why is y bound?
Rdf4j: one result :x :y :z (reported by @frensjan as https://github.com/eclipse-rdf4j/rdf4j/issues/4882)

Union

Intuition: values "before" (outside) Union clauses are used in the union. In particular, this is crucial for fetching multivalued props of a subject.

Canonical: no outside bindings are used in Union clauses

PREFIX : <http://example.org/>
SELECT * WHERE {
VALUES ?x { :x }
{} union {
    FILTER( BOUND(?x) )
    BIND( :y as ?y )
    BIND( ?x as ?z )
}
}

Jena, Blazegraph, Virtuoso: one result :x
Rdf4j: two results :x and :x :y :z

Referential Transparency

Intuition: binding a sub-expression to a variable then using it in bigger expressions should give the same result. So by any reasonable referential transparency principle, these should give the same result (except the latter should also bind ?effectivePrice)

Canonical: @frensjan writes these should give different results, but I don't know why

SELECT * {
VALUES ?price { 10 }
OPTIONAL {
    VALUES ?discount { 0.10 }
    FILTER( ?price * (1 - ?discount) < 10 )
}
}

has different semantics from:

SELECT * {
VALUES ?price { 10 }
OPTIONAL {
    VALUES ?discount { 0.10 }
    BIND( ?price * (1 - ?discount) AS ?effectivePrice )
    FILTER( ?effectivePrice < 10 )
}
}

Jena and Virtuoso: 10 0.1 and 10
Blazegraph: 10 and 10
rdf4j: 10 0.1 and 10 0.1 9.0

Tpt commented 8 months ago

The last one ("Referential Transparency") is because filter clauses in the RIGHT part of LEFT OPTIONAL { RIGHT } have also access to the variables bound by LEFT (see the If A is of the form Filter(F, A2) special case of graph patterns translation. So, in the first query ?price is bounded because it is in the FILTER whereas in the second query ?price is unbound in the BIND because it's not a FILTER and, so, usual bottom-up evaluation semantic is used. The placement of FILTER is a nasty part of the SPARQL spec

klinovp commented 8 months ago

I believe I can explain the "Jena: one result :x :y but why is y bound?" thing. The LHS and RHS of the outer join (the OPTIONAL) are single solutions: ?x = :x on the left, and ?y = :y on the right. Since the right does NOT bind ?x, the solutions are compatible (as per 18.3). Thus the join solution is ?x = :x, ?y = :y. That solution binds ?x, thus the FILTER's expression passes and thus the join condition succeeds (that is, the FILTER is a part of the join condition). Therefore, the joined solution is returned, as per the OPTIONAL semantics, not the LHS solution.

Others look a bit more straightforward to me.

afs commented 8 months ago

As @klinovp says.

OPTIONAL would have been better with a syntax like OPTIONAL(left join filter expression) { pattern }.

Hindsight.

lisp commented 8 months ago

@VladimirAlexiev , you would have a stronger case if you were

not to qualify your discourse with "I don't know SPARQL algebra very well, but this is [what the algebra says]."
to provide a basis for intuition - for example that the interpretation follows strictly from local interpretation of lexically apparent forms.
to provide some valuable use case which the current language definition precludes.
to define the semantics which the "switch" is to effect.

as the issue is expressed, the likely responses will be along the lines of that from @Tpt , which are not likely to lead to what might be a useful discussion about valuable changes to the language.

VladimirAlexiev commented 8 months ago

@klinovp

single solution: ?y = :y on the right

But there's FILTER( BOUND(?x) ) so why is that solution not discarded?

@lisp I should have written "Most SPARQL users haven't heard about SPARQL algebra" (and should not have to!). I consider myself a competent SPARQL user (eg see https://gist.github.com/VladimirAlexiev/cf2de89b692bbc2ae70917aae021ec07) but I don't care to learn or try to understand these peculiarities.

If what @Tpt wrote is true, then ?price is visible in FILTER( ?price... but invisible in BIND( ?price...: I think this defies logic or explanation.

lisp commented 8 months ago

@VladimirAlexiev, given the range of your contributions to the issues in this community group, your assessment, that the sparql optional filter semantics in relation to variable scope "defies logic or explanation", seems out of place. if you do not care to "learn or try to understand" its algebra, how do you propose to articulate an alternative sparql semantics which realizes your variant variable scoping rules?

Tpt commented 8 months ago

A bold proposal (probably more for SPARQL 2.0 rather than 1.x):

I we state that it is a syntax error to use in an expression a variable that is not in-scope, ie. prevent using in expression variables that will be always unbound, I believe we have a way to prevent the user to fall into the listed "traps".

"brackets" query will be a syntax error: ?x not in-scope in the BIND( ?x as ?z )
"optional" query will be a syntax error: ?x not in-scope in the BIND( ?x as ?z )
"union" query will be a syntax error: ?x not in-scope in the BIND( ?x as ?z )
"Referential Transparency" second query will be a syntax error: ?price not in-scope in the BIND( ?price * (1 - ?discount) AS ?effectivePrice )

Note that we already have in-scope constraints in the SPARQL grammar (see note 12).

Such a change would not restrict SPARQL expressivity (not in-scope variable in expression can always be simplified).

lisp commented 8 months ago

are there any cases in sparql, where the scope of a variable is not statically apparent? if not, then it would introduce not change any result to use that determination to classify queries syntactically. in which case, is it necessary to change the major version number?

klinovp commented 8 months ago

@VladimirAlexiev

But there's FILTER( BOUND(?x) ) so why is that solution not discarded?

Have a look at the SPARQL algebra. The FILTER is a part of the join condition, it's not just a filter sitting on top of the RHS only. It's evaluated over the join solution, not over the RHS solution.

Your query:

PREFIX : <http://example.org/>
SELECT * WHERE {
    VALUES ?x { :x }
    OPTIONAL {
        FILTER( BOUND(?x) )
        BIND( :y as ?y )
        BIND( ?x as ?z )
    }
}

is NOT the same as this query:

PREFIX : <http://example.org/>
SELECT * WHERE {
    VALUES ?x { :x }
    OPTIONAL {
        {
        FILTER( BOUND(?x) )
        BIND( :y as ?y )
        BIND( ?x as ?z )
        }
    }
}

The latter would return ?x = :x, as you expect. Again, the algebra should make the difference fairly obvious.

klinovp commented 8 months ago

If what @Tpt wrote is true, then ?price is visible in FILTER( ?price... but invisible in BIND( ?price...: I think this defies logic or explanation.

@Tpt is correct and the semantics makes perfect sense. What is confusing here is the syntax. As Andy noted above, a better syntax would make it obvious that the FILTER is a part of the join, not a post-processor of the OPTIONAL scope. It'd look like this:

SELECT * {
    VALUES ?price { 10 }
    OPTIONAL ( ?price * (1 - ?discount) < 10  ) {
        VALUES ?discount { 0.10 }
    }
}

or

SELECT * {
    VALUES ?price { 10 }
    OPTIONAL ( ?effectivePrice < 10 ) { # <-- is evaluated over joined solutions
        VALUES ?discount { 0.10 }
        BIND( ?price * (1 - ?discount) AS ?effectivePrice )  # <-- is evaluated over RHS solutions (and thus raises errors)
    } 
}

In the current SPARQL syntax the FILTER and the BIND are syntactically close to each other which obscures the fact that they are positioned in very different places in the algebra and process different binding sets.

Tpt commented 8 months ago

are there any cases in sparql, where the scope of a variable is not statically apparent?

@lisp No, the scope is defined from the syntax tree by the spec.

if not, then it would introduce not change any result to use that determination to classify queries syntactically.

Yes! Exactly!

in which case, is it necessary to change the major version number?

This change makes invalid some queries that were valid and well defined according to SPARQL 1.0/1.1 So it looks kind of breaking to me. But it's only my personal opinion.

lisp commented 8 months ago

under the premise, that "the scope is [completely and correctly] defined from the syntax tree by the spec", if that definition is used to identify invalid queries,

what class of queries is valid and well defined, which comprises expressions which include variable references outside of the scope of some definition?
what class of queries is no longer valid and well defined, which comprises expression in which all variable references ar in the scope of some definition?

Tpt commented 8 months ago

under the premise, that "the scope is [completely and correctly] defined from the syntax tree by the spec", what class of queries is "valid and well defined" which comprises expressions which include variable references outside of the scope of some definition?

All queries that contains variable not in-scope in expressions like the 4 queries I listed in this answer. They are all valid SPARQL queries.

lisp commented 8 months ago

@tpt, how can it be true that,

[...] queries that contain variables not in-scope in expressions like the 4 queries I listed https://github.com/w3c/sparql-dev/issues/195#issuecomment-2002585506 [...] are all valid SPARQL queries.

is it not correct, that the expression in a bind form must include only variables in some scope in order for the containing query to be valid? this, independent of whether the variables happen to be bound in a given solution.

Tpt commented 8 months ago

@Tpt, how can it be true that,

[...] queries that contain variables not in-scope in expressions like the 4 queries I listed https://github.com/w3c/sparql-dev/issues/195#issuecomment-2002585506 [...] are all valid SPARQL queries.

is it not correct, that the expression in a bind form must produce a value in order for the containing query to be valid?

The definition of extend (the algebra operation behind BIND) is defined even if the expression returns an error. And the SPARQL grammar only states that The variable assigned in a BIND clause must not be already in-use within the immediately preceding TriplesBlock within a GroupGraphPattern. but does not adds any restriction to the expression.

At my knowledge, there is not syntactic way in SPARQL to ensure that an arbitrary expression never fails. For example 1 + ?x can error if ?x is an IRI...

klinovp commented 8 months ago

At my knowledge, there is not syntactic way in SPARQL to ensure that an arbitrary expression never fails. For example 1 + ?x can error if ?x is an IRI...

Actually, there is: bind(coalesce(expr, "error") as ?x) will never fail. If expr raises an error, ?x will be bound to "error". Query engines can use this fact to reason about (lack of) NULLs.

Otherwise, I agree. A query can have a BIND which refers to variables out of scope and be perfectly valid. Moreover, that BIND, which refers to variables out of scope, may not even raise errors at runtime.

lisp commented 8 months ago

from the point of view of interoperability, are there defined results for these queries which include forms which reduce to a reference to an undefined (not just unbound) variable?

klinovp commented 8 months ago

I don't know what you mean by "from the point of view of interoperability" but yes, the spec does define results, incl. for queries that use BIND referring to out-of-scope variables. It's spec's job to define results for each syntactically valid query given the data.

lisp commented 8 months ago

I don't know what you mean by "from the point of view of interoperability"

the point is is that, while extend is defined such that a static analysis of variable definitions would not change the class of a query expression from undefined to invalid based on the constitution of the respective expression, it is not clear to this recommendation reader that this is the case for all expressions which include references to undefined variables.
this relates to the matter, whether to apply the results of such analysis would require a 2.* revision.

from this perspective, as reclassifying those expressions described in @tpt's list would require to change the definition for extend, the suggestion would require more at least a 2.0 jump, while for any other expressions which would change class from an undefined result to an invalid, a 1.* revision should suffice.

Tpt commented 8 months ago

from this perspective, as reclassifying those expressions described in @Tpt's https://github.com/w3c/sparql-dev/issues/195#issuecomment-2002585506 would require to change the definition for extend, the suggestion would require more at least a 2.0 jump, while for any other expressions which would change class from an undefined result to an invalid, a 1.* revision should suffice.

Nit: my proposal would not change the Extend operator definition but add a syntaxic restriction to the SPARQL grammar just like the existing one that prevents ?x in BIND(... AS ?x) to be already in-scope.

VladimirAlexiev commented 8 months ago

@lisp

to provide some valuable use case which the current language definition precludes.

I think I have one: fetching multi-valued fields of a subject, each of which needs to be in its own UNION clause to avoid Cartesian Product. If the bindings before/outside are not available inside the UNION, then you need to refetch that subject in each clause.

Eg see https://vocab.getty.edu/doc/queries/#All_Data_For_Subject and imagine that:

BIND (ulan:500115493 as ?s) is instead a heavy subquery (that's why I think that finding something using a complex search, and then returning its data that has a complex shape, makes for a difficult query)
Consider nested UNIONs, eg the part ?s ^iso:superOrdinate ?ar FILTER NOT EXISTS {?ar xl:prefLabel ?t1}: do I need to repeat it in each further sub-clause?

Can you rewrite this query?

@Tpt

how do you propose to articulate an alternative sparql semantics

I'm not competent enough to articulate an alternative. I'm just shocked at these "features" of SPARQL.

@klinovp

the semantics makes perfect sense. What is confusing here is the syntax.

Ok, this clarification is important for this forum, but it will be lost on any SPARQL user.

If the effective use of SPARQL requires learning the intricacies of an Algebra then that's a bad thing. Note that different repositories give different answers to (at least some of) the puzzles above. Hopefully these are borderline cases that users won't encounter often...

A clarification: I have the utmost respect for the members of this group (and all other creators of SPARQL), and similar for XQuery and XSPARQL... Devising a good query language is a difficult task, and passing it through the W3C standardization process is more difficult yet. And I hope I haven't embarrassed myself too badly :-)

lisp commented 8 months ago

If the effective use of SPARQL requires learning the intricacies of an Algebra then that's a bad thing.

you may not be willing to articulate it, but you imply that your domain would benefit from ways to manipulate its datasets which more directly correspond to its concepts than sparql does - or even, likely should.

Note that different repositories give different answers to (at least some of) the puzzles above.

from the discourse in this thread it appears clear that the recommendation is not ambiguous and that any differences are a consequence of implementation "variations". one could endeavour to increase the interoperability. this argues to include appropriate tests in a 1.2 test suite, but that would not, itself, bring you closer to your goal.

w3c / sparql-dev