Regular expression affected by Unicode changes

michaelhkay commented 6 months ago

A number of regular expression tests are failing because Unicode has changed since the tests were written. Specifically:

  regex-syntax-0256
  regex-syntax-0288
  regex-syntax-0338
  regex-syntax-0370
  regex-syntax-0394
  regex-syntax-0429
  regex-syntax-0480
  regex-syntax-0738

reference groups IsGreek, IsPrivateUse, and IsCombiningMarksforSymbols.

The group IsGreek was renamed IsGreekAndCoptic; IsCombiningMarksforSymbols has been renamed IsCombiningDiacriticalMarksforSymbols; IsPrivateUse is now IsPrivateUseArea.

the specification (F+O 3.1) says "A regular expression that uses a Unicode block name that is not defined in the version(s) of Unicode supported by the processor (for example \p{IsBadBlockName}) is deemed to be invalid [[err:FORX0002]]."

Implementations might choose to support obsolete block names for backwards compatibility, but as far as the tests are concerned, I think we should stick to block names as used in recent versions of Unicode. The new names appear to be valid at least as far back as Unicode 5.0.0, dated 2006.

michaelhkay commented 6 months ago

In the case of the PrivateUse block, the change is a bit more complex because the block no longer exists in its original form; it has been split into several separate blocks (all the Unicode blocks are now contiguous sequences of codepoints).

michaelhkay commented 6 months ago

It's worth citing the XSD 1.1 spec here:

[[Unicode Database]] has been revised since XSD 1.0 was published, and is subject to future revision. In particular, the grouping of code points into blocks has changed, and may change again. All [·minimally conforming·] processors must support the blocks defined in the version of [[Unicode Database]] cited in the normative references ([Normative (§K.1)]) or in some later version of the Unicode database. Implementors are encouraged to support the blocks defined in earlier and/or later versions of the Unicode Standard. When the implementation supports multiple versions of the Unicode database, and they differ in salient respects (e.g. different characters are assigned to a given block in different versions of the database), then it is [·implementation-defined·] which set of block definitions is used for any given assessment episode.

In particular, the version of [[Unicode Database]] referenced in XSD 1.0 (namely, Unicode 3.1) contained a number of blocks which have been renamed in later versions of the database Since the older block names may appear in regular expressions within XSD 1.0 schemas, implementors are encouraged to support the superseded block names in XSD 1.1 processors for compatibility, either by default or [·at user option·]. At the time this document was prepared, block names from Unicode 3.1 known to have been superseded in this way included:

x0370 - #x03FF: Greek

x20D0 - #x20FF: CombiningMarksforSymbols

xE000 - #xF8FF: PrivateUse

xF0000 - #xFFFFD: PrivateUse

x100000 - #x10FFFD: PrivateUse

As far as the XSLT test suite is concerned, I'm going to drop use of the old names. However, Saxon will probably continue to support them.

michaelhkay commented 6 months ago

Tests now updated to avoid obsolete block names.

w3c / xslt30-test

Regular expression affected by Unicode changes #76

x0370 - #x03FF: Greek

x20D0 - #x20FF: CombiningMarksforSymbols

xE000 - #xF8FF: PrivateUse

xF0000 - #xFFFFD: PrivateUse

x100000 - #x10FFFD: PrivateUse