Closed michaelhkay closed 6 months ago
In the case of the PrivateUse
block, the change is a bit more complex because the block no longer exists in its original form; it has been split into several separate blocks (all the Unicode blocks are now contiguous sequences of codepoints).
It's worth citing the XSD 1.1 spec here:
[[Unicode Database]] has been revised since XSD 1.0 was published, and is subject to future revision. In particular, the grouping of code points into blocks has changed, and may change again. All [·minimally conforming·] processors must support the blocks defined in the version of [[Unicode Database]] cited in the normative references ([Normative (§K.1)]) or in some later version of the Unicode database. Implementors are encouraged to support the blocks defined in earlier and/or later versions of the Unicode Standard. When the implementation supports multiple versions of the Unicode database, and they differ in salient respects (e.g. different characters are assigned to a given block in different versions of the database), then it is [·implementation-defined·] which set of block definitions is used for any given assessment episode.
In particular, the version of [[Unicode Database]] referenced in XSD 1.0 (namely, Unicode 3.1) contained a number of blocks which have been renamed in later versions of the database Since the older block names may appear in regular expressions within XSD 1.0 schemas, implementors are encouraged to support the superseded block names in XSD 1.1 processors for compatibility, either by default or [·at user option·]. At the time this document was prepared, block names from Unicode 3.1 known to have been superseded in this way included:
x0370 - #x03FF: Greek
x20D0 - #x20FF: CombiningMarksforSymbols
xE000 - #xF8FF: PrivateUse
xF0000 - #xFFFFD: PrivateUse
x100000 - #x10FFFD: PrivateUse
As far as the XSLT test suite is concerned, I'm going to drop use of the old names. However, Saxon will probably continue to support them.
Tests now updated to avoid obsolete block names.
A number of regular expression tests are failing because Unicode has changed since the tests were written. Specifically:
reference groups
IsGreek
,IsPrivateUse
, andIsCombiningMarksforSymbols
.The group
IsGreek
was renamedIsGreekAndCoptic
;IsCombiningMarksforSymbols
has been renamedIsCombiningDiacriticalMarksforSymbols
;IsPrivateUse
is nowIsPrivateUseArea
.the specification (F+O 3.1) says "A regular expression that uses a Unicode block name that is not defined in the version(s) of Unicode supported by the processor (for example \p{IsBadBlockName}) is deemed to be invalid [[err:FORX0002]]."
Implementations might choose to support obsolete block names for backwards compatibility, but as far as the tests are concerned, I think we should stick to block names as used in recent versions of Unicode. The new names appear to be valid at least as far back as Unicode 5.0.0, dated 2006.