Closed faassen closed 4 months ago
Yes, all very difficult.
XSD 1.0 is fairly clear that these blocks are excluded: it also says "All [·minimally conforming·] processors [·must·] support the blocks defined in the version of [[Unicode Database]] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the blocks defined in any future version of the Unicode Standard." which is a pretty strong statement that obsolete blocks like Greek should still be recognized.
XSD 1.1 gives processors much more leeway in terms of how they handle changes in Unicode over time; it also has some rather vague guidance about how unrecognized block names should be handled (either as blocks that match no characters, or as warnings, or as errors)
And then, just to add extra uncertainty, XPath 3.1 says "Implementers, even in cases where XSD 1.1 is not supported, are advised to consult the XSD 1.1 regular expression specification for guidance on how to handle cases where the XSD 1.0 specification is unclear or inconsistent."
There are cases where we can handle permitted variations in behaviour using either (a) dependencies (e.g. "unicode-version") or (b) alternative test results. But sometimes I think it's just best to admit defeat and delete tests for which there is no clear specification of what a conformant processor should do.
I'm surprised by the discovery of the Surrogates blocks in the tests. There are only two possible interpretations, I think: either (a) reject the regex as invalid, or (b) treat the block as containing no characters. It's impossible to say definitively which of those is appropriate.
I've dropped the tests for the Surrogates blocks, and modified the tests that use obsolete block names.
In
fn/matches.re.xml
, there are tests re00461, re00981 which refer toIsHighSurrogates
, and testsre00369
which refers toIsLowSurrogates
. I'm trying to understand whether these should be accepted (as they are).In the XML Schema 1.0 specification, in F.1.1:
"The blocks mentioned above exclude the HighSurrogates, LowSurrogates and HighPrivateUseSurrogates blocks. These blocks identify "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on."
This seems clear: these should not be accepted.
In the XML Schema 1.1 specification things are more confusing.
In G.4.2.2 Category escapes, in the category definition at the bottom:
but this is for
IsCategory
. It appears in XML Schema 1.1IsHighSurrogates
,IsLOwSurrogates
andHigh Private Use Surrogates
are accepted, unlike in XML Schema 1.1.Perhaps we should add a dependency type on xsd-version 1.1 for the tests that refer to these?