qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

[XSLT] Use of multiple predicates: order of evaluation #71

Closed michaelhkay closed 1 year ago

michaelhkay commented 3 years ago

I notice I added an example pattern to the draft XSLT4 spec match=".[. castable as xs:date][xs:date(.) le current-date()]" which is incorrect because processors are allowed to change the order of predicates, so you can't use the first predicate as a guard to stop the second predicate throwing an error. I've seen users fall over this (Saxon does sometimes reorder predicates). My instinct is to ban reordering of predicates; if you want to allow it, you can use the "and" operator. An alternative would be an "and" operator (say "and-also") with explicit ordering semantics, as in XPath 1.0.

dnovatchev commented 1 year ago

head($it1, $it2, $it3, ... , $itK)

Same here as for filter expressions: The argument of this function is a single expression that yields a sequence of items. In some cases, results can be returned iteratively, and evaluation can be skipped after the first item. In other cases, the expression needs to be fully evaluated before we know what will be the first item (e.g. when data is sorted).

This is different for the existing list of guarded expressions, which really consist of multiple subexpressions.

Why should this be a problem? If the evaluation hasn't taken place yet, then the guard provides useful information, otherwise not.

This doesn't mean that guards are useless.

ChristianGruen commented 1 year ago

Why should this be a problem?

Look at the following code: Which values would you expect to be assigned to $it1 and $it2?

head(sort( (2, error(), 1 ))`

If the evaluation hasn't taken place yet, then the guard provides useful information

Guards should not be dependent on specifics of the implementation.

I’ll be glad to explain more details in the chat.

dnovatchev commented 1 year ago

Look at the following code: Which values would you expect to be assigned to $it1 and $it2?

head(sort( (2, error(), 1 ))`

@ChristianGruen It was an error from my side -- please ignore this. fn:head has just a single argument and it needs it. Thus at least for now let us not discuss it.

I am talking about functions that have at least one argument and their authors know that in certain cases the function may produce its results without needing some of its arguments. Then, it will be good if the author of the function could specify this fact. Like:

let $and :=function($a, lazy $b) { $a and $b}

dnovatchev commented 1 year ago

You only know the properties of the function item, as described in the data model. One of those properties is the implementation of the function, but you're not going to start decompiling byte code to find out what it does...

@michaelhkay,

Yes, but the XDM gives us the arity.

Thus, if:

  f($x, $y) = { g($x) ($y)}

where by definition, g($x) is exactly f($x, ?) ,

and if for a given $x,

function-arity( f($x, ?)) eq 0

Then the developer can specify a hint about this, and even the arity of a dynamically-produced function can be evaluated dynamically by the XPath processor,

Thus, in the case of 2-argument function, as above, the XPath processor can determine that the returned function f($x, ?) is the constant function (has arity 0), and thus can decide to short-circuit (skip the evaluation of the remaining argument(s))

Thus, we do not need to add any new properties to the XDM for functions.

Based on this, I will be writing a proposal for giving the developer the ability to specify short-circuiting hints.

michaelhkay commented 1 year ago

A postscript: I've been reviewing some of the Saxon optimizations to see whether they conform to the new rules. An interesting case is one that arose from the queries/stylesheets generated by Gunther Rademacher' Rex parser, where we get a lot of expressions of the form ($x = 5 or $x = 10 or $x = 12 or $x = 15 ...). We start by turning this into a general comparison ($x = (5, 10, 12, 15)) and this then gets turned into map{5:(), 10:(), 12:(), 15:()}=>contains($x). With some caveats about type checking, of course. With only literals involved, the question of guarding against errors doesn't arise. But it's a good illustration of why we should allow processors to change the order of evaluation provided it doesn't cause unwanted errors.

ChristianGruen commented 1 year ago

But it's a good illustration of why we should allow processors to change the order of evaluation provided it doesn't cause unwanted errors.

I agree, and I had ignored we also have such cases before you addressed it. Based on statistics of the accessed database, we are choosing a predicate and rewrite it to an index requests if we are confident that the rewritten expression is equivalent. For example, an expression like…

collection('DB')//element[text() = 'VALUE'][@id = 'ID']

…may be rewritten to one of the following expressions:

db:attribute('db', 'ID')/self::attribute(id)/parent::element[text() = 'VALUE']
db:text('db', 'VALUE')/parent::element[@id = 'ID']
michaelhkay commented 1 year ago

We have added rules about "guarded sub-expressions" and I think this resolves the issue.

dnovatchev commented 1 year ago

Based on the discussions on a related issue: #281, I propose that we add one more bullet (an additional case of a guarded expression):

michaelhkay commented 1 year ago

Does this proposal apply to static function calls, or dynamic function calls, or both?

In both cases I think the rule should be unnecessary, because the rules for evaluating static function calls and the rules for evaluating dynamic function calls both check the arity and raise an error before they attempt to evaluate the arguments.

We do have a problem, I think, that the "Errors and Optimization" rules are still too liberal. It should be clearer that when we prescribe a detailed sequence of evaluation, as we do for function calls, the outcome should be the same as if that sequence is followed literally.

dnovatchev commented 1 year ago

Does this proposal apply to static function calls, or dynamic function calls, or both?

Both .

In both cases I think the rule should be unnecessary, because the rules for evaluating static function calls and the rules for evaluating dynamic function calls both check the arity and raise an error before they attempt to evaluate the arguments.

The rules for coercion (4.4.4 Function coercion rules) dictate that:

"If F has lower arity than the expected type, then F is wrapped in a new function that declares and ignores the additional argument;"

So in our case the resulting function "has lower arity than the expected type" thus following the function coercion rules no error is raised, the resulting function is "wrapped in a new function that declares and ignores the additional argument".

Following this rule, the resulting function-arity being less than K means that the resulting function is "wrapped in a new function that declares and ignores the additional (N -K) arguments". It is these additional arguments that are ignored, for which the arity of the resulting function is therefore a guard.

dnovatchev commented 1 year ago

We need to consider adding to the list a few more obvious cases of expressions that have guarded subexpressions:

ChristianGruen commented 1 year ago

If the first operand of an arithmetic expression is 0, the second operand does influence the result in numerous cases. Examples:

michaelhkay commented 1 year ago

This is far too constraining on implementations, and achieves very little benefit for applications.

Remember that conditional jumps break the CPU pipeline and slow things down enormously. Doing (eval(x), eval(y), MUL) is much more efficient than doing (eval(x), IF zero GOTO end, eval(y), MUL, end:)

dnovatchev commented 1 year ago

If the first operand of an arithmetic expression is 0, the second operand does influence the result in numerous cases. Examples:

  • 0 * xs:double('NaN') returns NaN
  • 0 div xs:double('-INF') returns -0
  • 0 div 0 raises an error

Fair enough.

I am updating and removing these from the list. Still there are 4 new guard-types,

dnovatchev commented 1 year ago

This is far too constraining on implementations, and achieves very little benefit for applications.

Remember that conditional jumps break the CPU pipeline and slow things down enormously. Doing (eval(x), eval(y), MUL) is much more efficient than doing (eval(x), IF zero GOTO end, eval(y), MUL, end:)

@michaelhkay Duly noted.

I have updated the comment and removed * and div from the proposed new guard-conditions.

michaelhkay commented 1 year ago

E1 ! E2 is already on the list ("In a path expression of the form E1/E2 or E1//E2, and in a mapping expression of the form E1!E2, the right-hand operand E2 is guarded by the existence of the relevant context item in the result of evaluating E1.")

pow(x, y) - I can't see why you want to impose a rule here for the case y=0. An implementation is likely by default to evaluate its arguments from left to right and you're asking for pow() to be treated as a special case with a different order of evaluation. There's no significant user benefit from imposing such complexity on the implementation. And in any case, pow((), 0) is () so it's not true that the value doesn't depend on x.

dnovatchev commented 1 year ago

E1 ! E2 is already on the list ("In a path expression of the form E1/E2 or E1//E2, and in a mapping expression of the form E1!E2, the right-hand operand E2 is guarded by the existence of the relevant context item in the result of evaluating E1.")

Yes, now I see it.

Maybe it would be more easier/obvious to see in the text if the these three cases are listed as separate guard conditions?

pow(x, y) - I can't see why you want to impose a rule here for the case y=0. An implementation is likely by default to evaluate its arguments from left to right and you're asking for pow() to be treated as a special case with a different order of evaluation. There's no significant user benefit from imposing such complexity on the implementation. And in any case, pow((), 0) is () so it's not true that the value doesn't depend on x.

Well, it is not a big deal, but I think pow((), 0) should raise an error. Dobn't see any reason for the described behavior!

michaelhkay commented 1 year ago

All the math functions (like many others...) accept an empty sequence for the first argument and return an empty sequence in that case. I wouldn't argue that it was the right design choice, but we obviously can't change it.

liamquin commented 1 year ago

On Wed, 2023-01-11 at 10:51 -0800, Michael Kay wrote:

All the math functions (like many others...) accept an empty sequence for the first argument and return an empty sequence in that case. I wouldn't argue that it was the right design choice, but we obviously can't change it.

This came from a time when XQuery 1 and XPath 2 had pessimistic static typing, so if something could possibly be an empty sequence it was an error. This meant writing @.***) was a static (compile-time) error since there might not be a width attribute.

So allowing the empty sequence was a result from feedback from users of an early implementation.

I was never sure it was a good choice, and it was made worse by the fact that the implementation in question implemented strong typing but did not support user-defined schemas, so it couldn't consult a schema to find whether @width was required and/or had a fallback value.

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org

dnovatchev commented 1 year ago

All the math functions (like many others...) accept an empty sequence for the first argument and return an empty sequence in that case. I wouldn't argue that it was the right design choice, but we obviously can't change it.

Let us ask the community if there is even a single case of someone relying on this. If none are reported, which seems the most likely outcome, then we will know that changing the semantics of the functions will not affect anyone and is the right thing to do.

I believe that this current "feature" makes some bugs very difficult to catch, thus it is not just a "curiosity" but is really harmful.

ChristianGruen commented 1 year ago

Let us ask the community if there is even a single case of someone relying on this.

We should try to stay focused. Initially, this issue was about the order of predicates.

dnovatchev commented 1 year ago

Let us ask the community if there is even a single case of someone relying on this.

We should try to stay focused. Initially, this issue was about the order of predicates.

Yes, I will open a separate issue/bug

michaelhkay commented 1 year ago

There are thousands of users running XSLT code that hasn't been touched for years, no-one knows who wrote it, let alone what edge cases in the spec it relies on. There is no-one in those organisations who regards themselves as a member of "the community" or who would read any notice we put out to that community, let alone respond to it. People expect to be able to switch in a new Saxon release without it breaking their code. The bar for introducing incompatibilities has to be very high; there have to be very strong benefits to justify the high costs.

michaelhkay commented 1 year ago

And of course people are using this feature. Perhaps not with the pow() function, but certainly with common functions like round(). People write things like if (round(@size) = 3)... all the time, expecting it to evaluate to false if @size does not exist. Breaking this code is unthinkable.

dnovatchev commented 1 year ago

There are thousands of users running XSLT code that hasn't been touched for years, no-one knows who wrote it, let alone what edge cases in the spec it relies on. There is no-one in those organisations who regards themselves as a member of "the community" or who would read any notice we put out to that community, let alone respond to it. People expect to be able to switch in a new Saxon release without it breaking their code. The bar for introducing incompatibilities has to be very high; there have to be very strong benefits to justify the high costs.

Completely understood.

There were very strong beliefs at the time that the Earth was flat. Heretics that dared say it was round were really brave, daring people. There seemed to be no too-strong and immediate benefits of accepting the truth. And it took the Church 350 years to admit that Galileo was right claiming the Earth revolves around the Sun.

Given these facts, why should we be in a hurry to try correcting such minor things as bugs in functions' specifications?

ChristianGruen commented 1 year ago

Given these facts, why should we be in a hurry to try correcting such minor things as bugs in functions' specifications?

Just joking: Given this analogy, maybe we should strive for something more universal than tweaking secondary rules in secondary utility functions…

benibela commented 1 year ago

Just joking: Given this analogy, maybe we should strive for something more universal than tweaking secondary rules in secondary utility functions…

Or a radical cut: Delete the XPath/XQuery/XSLT specs, and port all existing code to Javascript

michaelhkay commented 1 year ago

This thread is going nowhere. The original issue has been resolved: predicates cannot be reordered in a way that changes behaviour in either error or non-error cases. If there are other issues that remain, they should be explained in a new thread.