qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

[XSLT] Use of multiple predicates: order of evaluation #71

Closed michaelhkay closed 1 year ago

michaelhkay commented 3 years ago

I notice I added an example pattern to the draft XSLT4 spec match=".[. castable as xs:date][xs:date(.) le current-date()]" which is incorrect because processors are allowed to change the order of predicates, so you can't use the first predicate as a guard to stop the second predicate throwing an error. I've seen users fall over this (Saxon does sometimes reorder predicates). My instinct is to ban reordering of predicates; if you want to allow it, you can use the "and" operator. An alternative would be an "and" operator (say "and-also") with explicit ordering semantics, as in XPath 1.0.

benibela commented 3 years ago

I would ban any reordering that would lead to throwing errors

ChristianGruen commented 3 years ago

Banning sounds good.

yamahito commented 3 years ago

I have been thinking a while about this, and agree that banning is correct; consider:

preceding-sibling::*[1][@type='footnote']
preceding-sibling::*[@type='footnote'][1]

Even though no type error would be thrown, reordering of the predicates here significantly changes the meaning and result of the XPath statement.

michaelhkay commented 3 years ago

Obviously you're only allowed to reorder predicates where it doesn't affect the semantics of a non-error query. The debate here is about case where the only effect of reordering is on the detection or non-detection of errors, that is, the scenario within the scope of the infamous "Errors and Optimization" section of the spec.

dnovatchev commented 3 years ago

The debate here is about case where the only effect of reordering is on the detection or non-detection of errors

But we still can have (note the parens below):

(.[. castable as xs:date]) [xs:date(.) le current-date()]

and this enforces the wanted order of evaluation, right?

For me the problem is whether or not there should be "shortcutting". That is, should evaluation continue even after it has been determined that the value of the sub-expression, on which the predicate must be applied, is ()

Not sure if the Spec says anything about shortcutting in this case.

BaseX doesn't raise an error on this:

for $d in 10,
    $sub in $d[. castable as xs:date]
  return
     $sub[xs:date(.) le current-date()]

But throws an error ("[XPTY0004] Cannot convert xs:integer to xs:date : 10") on this:

for $d in  10
  return
    ($d[. castable as xs:date])[xs:date(.) le current-date()]
ChristianGruen commented 3 years ago

BaseX doesn't raise an error on this:

I noticed it also raises an error with the latest release of BaseX (because both queries are now rewritten to the same representation). The error results from a compilation step, which determines that the evaluation of the second predicate will never succeed for the given input.

michaelhkay commented 3 years ago

Parens don't enforce order of evaluation. The infamous rules in "Errors and Optimization" allow all expressions (with a few exceptions such as conditionals) to be rewritten to equivalent expressions without regard to the effect on error behaviour. Stripping parens is an obvious example.

As for your example, type errors are subject to different rules. A processor can report a type error statically for any expression if it can determine that execution of that expression would always fail; it doesn't have to establish that the expression will actually be evaluated.

dnovatchev commented 3 years ago

Parens don't enforce order of evaluation

Then what are they for?

michaelhkay commented 3 years ago

Parens override the default precedence of operators, they don't control order of evaluation. (a+b)+c can be rewritten as a+(b+c) because in the absence of errors the results are the same.

dnovatchev commented 3 years ago

Parens override the default precedence of operators, they don't control order of evaluation.

Yes. Then we probably need a special operator or function that would instruct the XPath processor that a sub-expression should be evaluated lazily. Such as:

.[. castable as xs:date] => lazy( [xs:date($z) le current-date()] )

It seems that (at least in this case) we can use the ! operator to achieve evaluation in the wanted order. For example, instead of:

for $d in  10
  return
    $d[. castable as xs:date]![xs:date(.) le current-date()]

which produces an error in Saxon but is evaluated successfully in BaseX 9.5

we can have:

for $d in  10
  return
    $d[. castable as xs:date]  !  function($x) {$x[xs:date(.) le current-date()]}

And the latter expression is evaluated without raising an error, both in Saxon and BaseX.

So,

lazy(expression($arg1, $arg2, ..., $argN))

could be an abbreviation for:

! function($arg1, $arg2, ..., $argN) {expression($arg1, $arg2, ..., $argN)}
ChristianGruen commented 3 years ago

But throws an error ("[XPTY0004] Cannot convert xs:integer to xs:date : 10") on this:

The query yielded an error because the input was supplied statically. While that should be a rather uncommon use case, I decided to change the behavior of BaseX: If values are inlined at compile time, no error will be raised anymore, but the predicate expression will be replaced with an fn:error function, which will only be raised later on at runtime if the predicate will actually be evaluated.

ChristianGruen commented 3 years ago

@michaelhkay Thanks for my answer to https://github.com/qt4cg/qtspecs/issues/78#issuecomment-851046681. I’m continuing in the original issue:

Because some XQuery vendors translate operators such as and/or to other languages such as SQL, and reordering of terms in a WHERE clause to take maximum advantage of indexes is a key feature of SQL optimisers.

I see how this was a substantial argument for XQuery 1.0. I wonder if it’s still relevant for XQuery 4? XQuery 3/3.1 has so many constructs that cannot be expressed with standard relational algebra anymore. Do any SQL implementations for XQuery 3.1 exist at all?

Multiple predicates can easily be rewritten to a sequence of and operands, and vice versa. The same applies to multiple subsequent where clauses, so I always regarded all those as interchangeable:

(: currently equivalent if the tests are not positional :)
SEQ [ TEST1 and TEST2 ]

SEQ [ TEST1 ] [ TEST2 ]

for ITEM in SEQ
where TEST1 and TEST2
return RESULT

for ITEM in SEQ
where TEST1
where TEST2
return RESULT

If we restricted the evaluation order of predicates, those queries would not be equivalent anymore, and potential SQL implementations would also need to suppress these kinds of rewritings in the future.

From the user perspective, I’m pretty sure that the reordering of logical expressions is rather unexpected (as long as a developer doesn’t have a strong database background). I would claim there is already so much code out there that relies on a short-circuit evaluation of logical expressions – even more than for predicates, the syntax of which is not that explicit.

michaelhkay commented 2 years ago

Coming back to this thread after a long absence.

Firstly, I think there's a need both for constructs that allow reordering for optimisation purposes, and for constructs that force short-circuiting so that dynamic errors can be prevented.

Currently the only way of writing (A and B) in a way that forces A to be checked before B is evaluated is to use a conditional: if (A) then B else false().

Neither [A and B] nor [A][B] currently enforces the order. If existing implementations take advantage of this for optimization purposes then I don't think we should require them to change - the effect could be quite significant on performance of existing code.

So I think my preferred option would to add new operators with enforced short-circuiting: possible syntax "and-if" and "or-if". The semantics would be defined in terms of conditional expressions, which require that only one branch is evaluated (or at any rate, if a branch is pre-emptively evaluated, this must not have any visible effects like throwing an error).

Note: Saxon treats the predicates [A][B] and [A and B] as interchangeable (assuming the predicates are non-positional), and the optimiser does reorder the terms in some circumstances, as permitted by the spec. For example, if one of the terms is eligible for indexing then it will be moved to the left of terms that require a serial search.

dnovatchev commented 2 years ago

So I think my preferred option would to add new operators with enforced short-circuiting: possible syntax "and-if" and "or-if". The semantics would be defined in terms of conditional expressions, which require that only one branch is evaluated (or at any rate, if a branch is pre-emptively evaluated, this must not have any visible effects like throwing an error).

This is good and it specifies shortcutting based on the value of the 1st argument of and.

For functions like fn:fold-right() it is necessary to have shortcutting on the second (right) argument, so we will also need:

A and-if-only B

This evaluates B first and if B is false, it skips the evaluation of A and returns false()

michaelhkay commented 2 years ago

I don't understand. Why can't they just write B and-if A if they have this rather unusual requirement?

ndw commented 2 years ago

I think and-if is confusing. I see what is intended, but the thing that comes after "if" would usually be the conditional in ordinary prose. So it reads like it depends on B which isn't the intent (unless I've misunderstood entirely).

I asked my resident linguist and she proposed and-then which I think is better.

dnovatchev commented 2 years ago

I wrote:

For functions like fn:fold-right() it is necessary to have shortcutting on the second (right) argument, so we will also need:

A and-if-only B

Sorry, this would be unnecessary if we move forward with lazy evaluation (on demand) of the arguments on a function call -- this will be the subject of a new, separate proposal, to come shortly. With Lazy Arguments evaluation:

fold-right( (false(), (1 to 1000000000000000000000000000000000000) ! true() ),   true(),   op('and') )

this is evaluated as:

false() and fold-right($notEvaluatedNotAvailableYetTailArgument, true(), op('and') )

And due to the shortcutting of and , this immediately evaluates to false(), thus the 2nd argument to this and: (fold-right($notEvaluatedNotAvailableYetTailArgument, true(), op('and')` is completely ignored (skipped, not evaluated at all)

I have seen some people wonder: "why on Earth fn:fold-right() exists at all", and they would be right were it not for shortcutting and lazy argument evaluation, as it is in Haskell, from which foldr was borrowed, renamed and its signature (order of arguments) adjusted.

michaelhkay commented 2 years ago

I thought about and-then (and and-also)

I still prefer and-if.

ndw commented 2 years ago

I really think and-if is going to be confusing for a lot of users.

An ordinary reading of "if I am tired and if it's past midnight then I should go to bed" means that you should go to bed if you're tired AND it's past midnight. That doesn't convey the short-circuit semantics to me at all. Of course, that's equally true of the other options you've shown so...bleh.

I think it would be better to have something that doesn't imply any particular semantics than something that implies the wrong semantics. Maybe short-and and short-or?

ChristianGruen commented 2 years ago

I agree with Norm that and if doesn't really indicate what is going to happen. What about then-and/then-or?

If we hadn't || for concatenations, we could have used && and ||.

dnovatchev commented 2 years ago

I thought about and-then (and and-also)

  • If I am tired and if it's past midnight then I should go to bed.
  • If I am tired and then it's past midnight then I should go to bed.
  • If I am tired and also it's past midnight then I should go to bed.

I still prefer and-if.

Considering all above alternatives, I feel that and also is closest to denoting sequential execution: we first check condition-1 and then also we check condition-2.

Maybe it would be even more unambiguous to use:

if condition1 and only then if condition2

Without the ifs:

condition1 and only then condition2

ChristianGruen commented 2 years ago

Or (one of my favorites), if(A) then B without else, defaulting to an empty sequence ;) Obviously, this would only work if the result can be xs:boolean?.

dnovatchev commented 2 years ago

condition1 and only then condition2

On further thought:

condition1 and additionally condition2

michaelhkay commented 1 year ago

For another example of a Saxon user falling over this, see https://saxonica.plan.io/issues/5721.

I think that defining the rules for predicates is probably more important than providing a short-cut "and-also" operator. Note that we don't need to define or constrain order of evaluation, we only have to say that no error is reported in respect of the second predicate unless the first predicate returns true. Optimizers that want to use indexes can still do so, provided they adjust the error handling accordingly.

dnovatchev commented 1 year ago

I think that defining the rules for predicates is probably more important than providing a short-cut "and-also" operator. Note that we don't need to define or constrain order of evaluation, we only have to say that no error is reported in respect of the second predicate unless the first predicate returns true. Optimizers that want to use indexes can still do so, provided they adjust the error handling accordingly.

I think Haskell got it right and lazy by default is the way to go. image

And it is good for the carbon footprint.

Not giving the programmer full control can have bad consequences. It is like not knowing at any moment who does what with your money.

I am beginning to understand why many wise, prominent and deeply understanding people are warning of the dangers of AI.

No need to watch _"Rick and Morty"_ in order to arrive at this conclusion.

Why would anyone want to use Saxon after seeing that it needs 100 seconds in order to evaluate this expression:

image

when it can be evaluated infinitely faster (0 seconds) in Haskell and also using BaseX (though they spent many many days working on their optimization techniques, which would have come for free if we had lazy evaluation) ?

image

And here is a video of doing the same in Haskell:

https://github.com/dnovatchev/FXSL-XSLT2/raw/master/Comp-Foldr%20Shortcutting%20-%20With%20Haskell%20-%20Replit%20-%20Google%20Chrome%202022-10-10%2012-45-44.m4v

liamquin commented 1 year ago

Can we not bash individual products? There is no work of humankind that cannot be improved.

Going back to the original post, i agree with Mike that saying you don't get errors from predicates to the right of ones that return false seems not too much of a restriction. I know there are implementations that evaluate XPath from right to left today, at least in an XQuery context, as it was discussed in an XQuery face-to-face meeting, but i don't know whether they raise errors in this case or suppress them.

michaelhkay commented 1 year ago

Saxon does of course use lazy evaluation very extensively; but I fail to see what that has got to do with this issue [the reason this query runs slowly is that we haven't done much work on speeding up dynamic function evaluation; our bread-and-butter is running DocBook and DITA stylesheets, and that's where we put our performance efforts].

Your example query doesn't seem to be relevant in the slightest to the issue I have raised here, which is whether we should change the rules in "Errors and Optimization" to give less freedom to implementations to rearrange the terms in a filter expression.

There's no doubt that re-arranging the terms of a filter expression can make a very big difference to performance with or without lazy evaluation; and there's no doubt also that doing so in a way that changes the error semantics (as permitted by the current specification) can cause a significant usability hassle. So could we debate that design trade-off please, and not go off at a tangent?

ChristianGruen commented 1 year ago

I would be glad if we can avoid adding new operators.

@dnovatchev I'm wondering why you added a very specific comparison of Saxon and BaseX in a subsequent edit of your comment, and I cannot see how this relates to the topic of this issue.

dnovatchev commented 1 year ago

I would be glad if we can avoid adding new operators.

@dnovatchev I'm wondering why you added a very specific comparison of Saxon and BaseX in a subsequent edit of your comment, and I cannot see how this relates to the topic of this issue.

@ChristianGruen I think this must be clear: These are counter-examples for the statement by @michaelhkay:

I think that defining the rules for predicates is probably more important than providing a short-cut "and-also" operator.

These examples give us specific evidence how big a cost is actually incurred without shortcutting.

There's no doubt that re-arranging the terms of a filter expression can make a very big difference to performance with or without lazy evaluation; and there's no doubt also that doing so in a way that changes the error semantics (as permitted by the current specification) can cause a significant usability hassle. So could we debate that design trade-off please, and not go off at a tangent?

@michaelhkay Trying to set the priorities right is not "going off a tangent". When there is no need at all to evaluate an expression, it is not even meaningful (or it is at least premature) to think about rearranging its subexpressions, and doing so in advance often is loss of time and resources. To quote Donald Knuth: "Premature optimization is the root of all evil".

Can we not bash individual products? There is no work of humankind that cannot be improved.

@liamquin, I never attempted at "bashing individual products" and apologize if the comment was perceived as such.

On the contrary, this is an actual example how even the best product can incur significant performance loss if shortcutting is not supported.

And the idea is not at all to work "to improve" the product, as with lazy evaluation this improvement comes for free.

ChristianGruen commented 1 year ago

It may be important to stress that there is no single shortcutting or lazy evaluation approach that can globally be applied. Instead, it's a wide variety of optimization and runtime strategies that can make code faster.

michaelhkay commented 1 year ago

I would like to propose rewriting the paragraph in "Errors and Optimization" which currently reads as follows:

Conditional expressions must not raise a dynamic error in respect of subexpressions occurring in a branch that is not selected, and must not return the value delivered by a branch unless that branch is selected. Thus, the following example must not raise a dynamic error if the document abc.xml does not exist: if (doc-available('abc.xml')) then doc('abc.xml') else () Of course, the condition must be evaluated in order to determine which branch is selected, and the query must not be rewritten in a way that would bypass evaluating the condition.

To say instead:

In certain expressions (detailed below) one subexpression A may act as a guard for another subexpression B. This occurs in a situation where, if A takes a particular value, there is no need to evaluate B in order to determine the result of the parent expression. For example in a conditional expression if (X) then Y else Z, the condition X acts as a guard for both Y and Z: if X is true, there is no need to evaluate Z, while if X is false, there is no need to evaluate Y. We say that the condition X=true guards Z, while the condition X=false guards Y. A processor MUST ensure that no dynamic error in a guarded subexpression propagates in a way that causes the parent expression to fail.

Note: a processor is free to evaluate both branches of a conditional (partially or completely) before deciding which of the two values to use -- a strategy that might be advantageous if, for example, the two branches contain common subexpressions, or if this approach enables greater parallelism. But if the processor does this, then it must ensure that no dynamic error occurs that would otherwise not occur.

Note: the specification has nothing to say about side effects. Many implementations in practice allow external functions to have side effects, but the semantics of calls to such functions are implementation-defined. Any rules preventing guarded sub-expressions from having side effects are therefore a matter for the implementation.

The following expressions have guarded subexpressions where this rule applies:

dnovatchev commented 1 year ago

This table from Wikipedia provides data about the _short-circuit operators_ existing in 48 common programming languages (out of this 48 PLs, only one - ZTT comes with no short-circuit operators, and on the other side, 19 have no eager operators)

image

dnovatchev commented 1 year ago

It may be important to stress that there is no single shortcutting or lazy evaluation approach that can globally be applied. Instead, it's a wide variety of optimization and runtime strategies that can make code faster.

@ChristianGruen Absolutely agree that it might not be the best idea to blindly apply a global lazy (or global eager) evaluation strategy.

What is definitely better is to provide the user with constructs (hints) for lazy or eager evaluation of the arguments of a function - as part of the function definition.

michaelhkay commented 1 year ago

Most of the languages in the Wikipedia list are procedural languages where evaluating the operands can have side-effects. That makes it much more important to have rules controlling the order of evaluation.

dnovatchev commented 1 year ago

@michaelhkay ,

Most of the languages in the Wikipedia list are procedural languages where evaluating the operands can have side-effects. That makes it much more important to have rules controlling the order of evaluation.

Well, most functional programming languages are in this table: Lisp, ML, Haskell, OCaml, Erlang, Lua, Scheme, Kotlin

Plus a number of languages that support both FP and procedural programming: C#, Python, Java (Scala?), Javascript ..., etc.

So it seems, as a rule most FP languages do have short-circuiting operators

ChristianGruen commented 1 year ago

The comment in https://github.com/qt4cg/qtspecs/issues/71#issuecomment-1292802834 describes exactly what I would expect implementations to do (and I think it’s a solution for both this issue and #78). It means that:

michaelhkay commented 1 year ago

I have submitted a PR at https://github.com/qt4cg/qtspecs/pull/230 that attempts to resolve this.

dnovatchev commented 1 year ago
  • The right-hand operand of an "and" expression is guarded when the left-hand operand returns false.
  • The right-hand operand of an "or" expression is guarded when the left-hand operand returns true.

Yes, and we obviously need to add to this list:

michaelhkay commented 1 year ago

Interesting observation. I'm rather reluctant to special-case the rules for evaluating arguments of a dynamic function call based on what function you are calling. (How do you know what function you are calling? You only have the properties of the function item to go with. This would imply extending the data model to add some new properties for function items.) I'd prefer to add a warning/note here pointing out that the specification does not guarantee anything here (though an implementation might do so). I suggest handling this as a separate issue.

ChristianGruen commented 1 year ago

I would as well expect fn:op("and") and fn:op("or") to behave identically as and/or.

Do we really need to extend the data model, or wouldn’t it be sufficient to mention that code resulting from fn:op needs to follow the same rules as the equivalent static expressions?

dnovatchev commented 1 year ago

@michaelhkay

How do you know what function you are calling?

Isn't it a fact that an implementation knows (and actually has to know) when any standard function is being called, simply because the implementation owns (the implementation/itself) its code?

michaelhkay commented 1 year ago

You only know the properties of the function item, as described in the data model. One of those properties is the implementation of the function, but you're not going to start decompiling byte code to find out what it does...

dnovatchev commented 1 year ago

You only know the properties of the function item, as described in the data model. One of those properties is the implementation of the function, but you're not going to start decompiling byte code to find out what it does...

Yes, but you just call the function, passing to it functions that each produces its corresponding arguments, and the function itself knows that its first argument with a given value is a guard.

dnovatchev commented 1 year ago

@michaelhkay ,

The proposal in its current form as per https://github.com/qt4cg/qtspecs/pull/230 does not cover (at least in an obvious way) the case where in this expression:

(E1, E2)[1]

E2 is guarded by the existence of E1

Perhaps this rule can be generalized to say that in

(E1, E2, E3, ... , EN)[1]

All expressions E2, E3, ... , EN are guarded by the existence of E1

ChristianGruen commented 1 year ago

Filter expressions cannot be treated this way. Think of examples that are not that obvious:

error()[0]
$x[2 -1]
$x[position() < 2]
let $i := 1 return $x[$i]

And think of the definition of the filter expression:

The result of the filter expression consists of the items returned by the base expression, filtered by applying the predicate to each item in turn.

It's not the expressions that will matched against the predicates, but the resulting items. An expression may yield zero or more items. Next, if an expression yields an error, it will never be matched against a predicate, no matter if it may be positional or not.

One more good reason to include otherwise, by the way.

michaelhkay commented 1 year ago

(E1, E2)[1]

I don't think we should try to legislate for that one. It would raise too many questions, for example you could extend similar arguments for (E1, E2)[last()] - why should we assume that evaluating a sequence lazily from left-to-right is always possible, but evaluating from right-to-left isn't? With a rule like that, we would be making too many assumptions about the internals of an implementation.

The real problem, I think, is substitutability. The substitutability principle says that wherever we write a constant such as 1, we should be able to write an expression that evaluates to 1. But we can't say that E2 is guarded if the predicate evaluates to the value 1, because in the general case, the predicate has to be evaluated separately for each item in the input sequence (recall that [1] means [position()=1]). We can only do the optimization of evaluating it once for the entire sequence where it's an expression that has no dependencies on the context item, and we have no machinery in the specification for distinguishing such expressions.

dnovatchev commented 1 year ago

Let's have one more attempt at this:

head($it1, $it2, $it3, ... , $itK)

In this expression every item $itN is guarded by an existing item $itM, where M < N

We see that the semantics of some of the standard XPath functions establishes guards for some of their arguments.

As we are at work in enumerating and specifying all such guards, let's do the job properly.

Let us all investigate the standard XPath functions and all different type of expressions to discover any such guards, then properly enumerate them in the Specification.

ChristianGruen commented 1 year ago

head($it1, $it2, $it3, ... , $itK)

Same here as for filter expressions: The argument of this function is a single expression that yields a sequence of items. In some cases, results can be returned iteratively, and evaluation can be skipped after the first item. In other cases, the expression needs to be fully evaluated before we know what will be the first item (e.g. when data is sorted).

This is different for the existing list of guarded expressions, which really consist of multiple subexpressions.

michaelhkay commented 1 year ago

I really don't want to change the rules for function calls so that the way arguments are evaluated depends on which function you are calling.

dnovatchev commented 1 year ago

I really don't want to change the rules for function calls so that the way arguments are evaluated depends on which function you are calling.

Nobody is suggesting this.

Instead, we may provide a way for every function to specify its (if it has any) guards.