XQFO Code in the Rules sections

ChristianGruen commented 7 months ago

In #978, it’s being discussed what is the best language for presenting code in the Rules sections of the XQFO specification. Currently, XPath is used for compact equivalencies, for example…

(: array:size :)
count(array:members($array))

(: fn:remove :)
$input[not(position() = $positions)].

...while XQuery is used for more complex expressions, including function declarations, or when the XPath representation would be syntactically more complex. Examples:

(: fn:deep-equal :)
declare function equal-strings(
  $string1   as xs:string,
  $string2   as xs:string, 
  $collation as xs:string,
  $options   as map(*)
) as xs:boolean {
  let $n1 := if ($options?whitespace = "normalize"))
             then normalize-unicode(?, $options?normalization-form) 
             else identity#1
  let $n2 := if ($options?normalize-space)
             then normalize-space#1 
             else identity#1               
  return compare($n1($n2($string1)), $n1($n2($string2)), $collation) eq 0    
}

(: fn:index-where :)
for $item at $pos in $input
where $predicate($item, $pos)
return $pos

(: …flatten, fold-left, while-do, others :)

Finally, we have many cases in which XPath/XQuery code is omitted, either because the presented feature is basic enough, because the equivalent code would get too complicated, or (e.g., for fn:doc) because it does not provide means to express the feature.

We should strive for consistency and decide which language(s) the majority of us believes is the best choice…

XPath & XQuery (what we currently have)
XPath only
XPath, XQuery and XSLT (whatever seems most appropriate)
Other pseudocode
Don’t use pseudocode at all if it is too complex to be represented with moderately simple XPath code

ChristianGruen commented 7 months ago

Thanks to @dnovatchev's discussion, I had the honor of opening issue #1000.

michaelhkay commented 7 months ago

In 3.1 we went for dual exposition in XQuery and XSLT. That was probably a political compromise rather than anything else. In principle it's a bad idea to have two competing normative specifications.

Using XPath everywhere would be nice, but the key problem is that recursive functions in XPath are very cumbersome. Perhaps the best solution to that would be to find a better way of writing recursive functions in XPath. Maybe the answer to that is to add a subset of the XQuery Prolog to the XPath language (perhaps just function declarations and namespace declarations). XPath expressions, as used in XSLT, would still be expressions, with no Prolog, but the XPath specification would define a larger construct (called perhaps an XPath module) containing a Prolog and an Expression.

I don't think there is any fundamental problem with using XQuery as a specification language for the function library. We can use any language we like so long as it has the right data model and is well specified. But for tutorial purposes, it's nice if readers don't have to learn another language.

There's another problem we should be aware of, which is the danger of circular definitions. We should probably be taking more care to classify which constructs are basic and which are derived. Ideally we would define a "core" language containing the basics, and all other constructs would be specified using only this core.

michaelhkay commented 6 months ago

I think it might be useful, if only for our own editorial sanity, if we classified functions, and perhaps syntactic constructs, as intrinsic or extrinsic. Extrinsic functions and constructs should be specified in terms of a "lexical" equivalence that relies only on intrinsic constructs. For example, node-name() is intrinsic; local-name(), namespace-uri(), andname()` are extrinsic.

To avoid too much discontinuity from current practice, extrinsic functions should be defined as follows in order of preference:

(a) where possible, an XPath expression that can be used as the body of the function: for example local-name($node) is local-name-from-QName( node-name($node)).

(b) if that's not possible or convenient, use an XQuery expression.

(c) if the equivalent is recursive or requires supporting functions, express the equivalence using XQuery function declarations. Unless we can come up with a more readable way of expressing recursive functions in XPath, we should avoid them.

We should ideally have some way of checking that we haven't introduced any circularity into the specification.

dnovatchev commented 6 months ago

I think it might be useful, if only for our own editorial sanity, if we classified functions, and perhaps syntactic constructs, as intrinsic or extrinsic. Extrinsic functions and constructs should be specified in terms of a "lexical" equivalence that relies only on intrinsic constructs. For example, node-name() is intrinsic; local-name(), namespace-uri(), andname()` are extrinsic.

To avoid too much discontinuity from current practice, extrinsic functions should be defined as follows in order of preference:

(a) where possible, an XPath expression that can be used as the body of the function: for example local-name($node) is local-name-from-QName( node-name($node)).

👍

(b) if that's not possible or convenient, use an XQuery expression.

(c) if the equivalent is recursive or requires supporting functions, express the equivalence using XQuery function declarations. Unless we can come up with a more readable way of expressing recursive functions in XPath, we should avoid them.

Convenience is a rather subjective term.

We could have an introductory section which explains how XPath code is used in the Rules and which provides a simple example of recursion in XPath.

We can precede a recursive function definition with a suitable comment, such as:

(: Recursive function $factorial defined with the help of $factorial-inner :)

We shouldn't suppose that all implementors know XQuery well and any such assumption may really discriminate and be unfair to some potential implementors.

Some implementors may even have no XQuery processor available, not to speak about an XQuery 4.0 processor.

Not to speak any regular XSLT developer (user) who has no confidence in his XQuery knowledge and understanding and thus would feel inconvenient ~~(hate)~~ unfairly and unnecessarily taxed, and not confident reading the XQuery code.

They may also remain with the impression that the creators of the Spec were unfairly biased towards XQuery...

ChristianGruen commented 6 months ago

We shouldn't suppose that all implementors know XQuery well and any such assumption may really discriminate and be unfair to some potential implementors.

Until someone proves me wrong, I feel safe to say that everyone who’s smart enough to write an XPath 4.0 implementation will certainly be smart enough to understand some basic XQuery or XSLT code – let alone function declarations that basically have the same syntax as the function signatures in the XQFO spec.

No matter which language or pseudo-code we use, we should avoid syntactical constructs that don’t focus on the actual task, such as self-referencing function arguments. Just because something can be done doesn’t mean it should be done.

dnovatchev commented 6 months ago

Until someone proves me wrong, I feel safe to say that everyone who’s smart enough to write an XPath 4.0 implementation will certainly be smart enough to understand some basic XQuery or XSLT code – let alone function declarations that basically have the same syntax as the function signatures in the XQFO spec.

No matter which language or pseudo-code we use, we should avoid syntactical constructs that don’t focus on the actual task, such as self-referencing function arguments. Just because something can be done doesn’t mean it should be done.

Said the author of an XQuery processor ...

ChristianGruen commented 6 months ago

Said the author of an XQuery processor ...

…who included XSLT in the reply, and added a second comment on conciseness, which is more important than the used language.

The discussion is getting absurd. This is the current XQuery code for fn:flatten in the spec:

declare function flatten(
  $input as item()*
) as item()* {
  for $item in $input
  return if ($item instance of array(*)) then flatten(array:values($item)) else $item
};

An equivalent XPath version would be:

let $flatten-inner := function(
  $input as item()*,
  $self  as function(*)
) as item()* {
  for $item in $input
  return if ($item instance of array(*)) then $self(array:values($item), $self) else $item  
}
let $flatten := function(
  $input as item()*
) as item()* {
  $flatten-inner($input, $flatten-inner)
}
return ...

Only a minority of users would be able to judge that the second version is XPath at all.

ChristianGruen commented 6 months ago

I’ve added another suggestion, as a fallback if we shouldn’t manage to agree on one of the other alternatives:

Don’t use pseudocode at all if it is too complex to be represented with moderately simple XPath code

dnovatchev commented 6 months ago

I’ve added another suggestion, as a fallback if we shouldn’t manage to agree on one of the other alternatives:

Don’t use pseudocode at all if it is too complex to be represented with moderately simple XPath code

Quite a few issues with this:

Such a rule is obviously subjective - what is "too-complex" for one person may be "normal" for another one.
This will in certain cases prohibit the provision of executable code - that is an oracle for the results of the function. Having an oracle is immensely better than not having one. It simply eliminates the possibility of one implementor understanding and interpreting the rules in a different way than other implementors, resulting in different implementations of the function producing different results for some calls with the same arguments. Even if we have a thousand of tests, having an oracle is much better, as it covers all infinite number of possible tests
When the provided code is considered "too-complex", its readability can always be enhanced by providing comments within the code, or adding additional explanation in the Rules text. The code still remains the same - the only authoritative and precise tool for judging the correctness of any implementation. Which cannot be said at all about verbal-only rules that can be interpreted in a different way by different readers, and whose correctness, completeness and lack of ambiguity or of contradictions cannot be proven at all.

I think point 3. above is especially important. It tells us that we can always add more explanations without sacrificing the oracle code.

To summarize:

If we need such guidelines at all, they should be preceded by this one at the start of the list:

0. Having a test oracle is always better than not having one. Adhere to the remaining guidelines in this list only if they do not preclude the possibility to have a test oracle.

michaelhkay commented 6 months ago

I think my experience is that if the "oracle" requires more than about 10 lines of code, there's a strong danger of it being incorrect or imprecise: for example, making unwarranted assumptions about overflow behaviour. In addition, the primary purpose of the specification is to communicate clearly and precisely, and there comes a point where you can do that better in English than in complex code. So yes, it absolutely has to be subjective: deciding whether you're communicating clearly is a judgement call.

Generally my experience over my career has been that formalism, if taken to extremes, is counter-productive. The XQuery formal semantics was a good example of that. You end up with the situation where only a small elite tries to read the specification, which means it doesn't get widespread review, which means it has bugs. Oracles are useful, but only if they are sufficiently simple that the bugs stare out at you. They also need to be tested.

dnovatchev commented 6 months ago

I edited my previous comment and added this:

To summarize:

If we need such guidelines at all, they should be preceded by this one at the start of the list:

0. Having a test oracle is always better than not having one. Adhere to the remaining guidelines in this list only if they do not preclude the possibility to have a test oracle.

dnovatchev commented 6 months ago

I think my experience is that if the "oracle" requires more than about 10 lines of code, there's a strong danger of it being incorrect or imprecise: for example, making unwarranted assumptions about overflow behaviour. In addition, the primary purpose of the specification is to communicate clearly and precisely, and there comes a point where you can do that better in English than in complex code. So yes, it absolutely has to be subjective: deciding whether you're communicating clearly is a judgement call.

Generally my experience over my career has been that formalism, if taken to extremes, is counter-productive. The XQuery formal semantics was a good example of that. You end up with the situation where only a small elite tries to read the specification, which means it doesn't get widespread review, which means it has bugs. Oracles are useful, but only if they are sufficiently simple that the bugs stare out at you. They also need to be tested.

Yes, so let us try to define "sufficiently simple"?

Some people say that the code should be contained in a single screen with no need for scrolling.

Other advise that the number of lines should not exceed 30 or 50.

There are also objective code complexity metrics and some are already used in popular IDEs - for example Visual Studio issues an error message if the cyclomatic complexity exceeds 50.

In all cases having such precise and objective criteria is better and more truly indicative than having the case someone tell you that they cannot understand the code.

Pupils at school need encouragement to learn, not acknowledgements that they cannot.

dnovatchev commented 6 months ago

Generally my experience over my career has been that formalism, if taken to extremes, is counter-productive. The XQuery formal semantics was a good example of that. You end up with the situation where only a small elite tries to read the specification,

We are not talking about that kind of formalism - rather about having a few lines - XPath expression that everybody can readily execute - even in their existing XPath processors.

So it is not "only a small elite" - it is everyone can execute and test - with as many (unlimited number of) tests as they can come independently with.

which means it doesn't get widespread review, which means it has bugs. Oracles are useful, but only if they are sufficiently simple that the bugs stare out at you. They also need to be tested.

Very good point. In case the results of some proposed tests are not what the Oracle produces, then we have immediately discovered a discrepancy - either in the proposed tests, or in the Oracle, or in our understanding of what the function does - and thus we can immediately rectify the problem, as opposed to the case where we do not have an Oracle and the problem remains unnoticed indefinitely.

ChristianGruen commented 3 months ago

I don’t think we’ll make any further progress here (unless someone wants to prepare a PR that unifies all existing code snippets), so I propose to close the issue.

ndw commented 3 months ago

The CG agreed to close this issue without further action at meeting 079.

qt4cg / qtspecs

XQFO Code in the Rules sections #1000