w3c / qt3tests

Tests for XPath and XQuery
27 stars 17 forks source link

IsHighSurrogates and IsLowSurrogates #61

Closed faassen closed 4 months ago

faassen commented 4 months ago

In fn/matches.re.xml, there are tests re00461, re00981 which refer to IsHighSurrogates, and tests re00369 which refers to IsLowSurrogates. I'm trying to understand whether these should be accepted (as they are).

In the XML Schema 1.0 specification, in F.1.1:

"The blocks mentioned above exclude the HighSurrogates, LowSurrogates and HighPrivateUseSurrogates blocks. These blocks identify "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on."

This seems clear: these should not be accepted.

In the XML Schema 1.1 specification things are more confusing.

In G.4.2.2 Category escapes, in the category definition at the bottom:

Note: The properties mentioned above exclude the Cs property. The Cs property identifies "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.

but this is for IsCategory. It appears in XML Schema 1.1 IsHighSurrogates, IsLOwSurrogates and High Private Use Surrogates are accepted, unlike in XML Schema 1.1.

Perhaps we should add a dependency type on xsd-version 1.1 for the tests that refer to these?

michaelhkay commented 4 months ago

Yes, all very difficult.

XSD 1.0 is fairly clear that these blocks are excluded: it also says "All [·minimally conforming·] processors [·must·] support the blocks defined in the version of [[Unicode Database]] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the blocks defined in any future version of the Unicode Standard." which is a pretty strong statement that obsolete blocks like Greek should still be recognized.

XSD 1.1 gives processors much more leeway in terms of how they handle changes in Unicode over time; it also has some rather vague guidance about how unrecognized block names should be handled (either as blocks that match no characters, or as warnings, or as errors)

And then, just to add extra uncertainty, XPath 3.1 says "Implementers, even in cases where XSD 1.1 is not supported, are advised to consult the XSD 1.1 regular expression specification for guidance on how to handle cases where the XSD 1.0 specification is unclear or inconsistent."

There are cases where we can handle permitted variations in behaviour using either (a) dependencies (e.g. "unicode-version") or (b) alternative test results. But sometimes I think it's just best to admit defeat and delete tests for which there is no clear specification of what a conformant processor should do.

I'm surprised by the discovery of the Surrogates blocks in the tests. There are only two possible interpretations, I think: either (a) reject the regex as invalid, or (b) treat the block as containing no characters. It's impossible to say definitively which of those is appropriate.

michaelhkay commented 4 months ago

I've dropped the tests for the Surrogates blocks, and modified the tests that use obsolete block names.