Closed michaelhkay closed 2 months ago
It's not possible to refer to Unicode character classes without considering the version of Unicode in use. (For my Invisible XML processor, I wound up having to implement that part myself as Java doesn't implement any version of Unicode. It implements some version (I forget which and it probably varies by Java vesrion), plus the extra bits from the next version that they though were important 🙄 )
I think we should add a "Unicode version" feature so that we can assert results correctly with respect to the version of Unicode.
It is possible to annotate a test with
<dependency type="unicode-version" value="3.1.1"/>
but the feature isn't widely used (less than a dozen tests, I think). In fact, it appears that if a test has a dependency on a specific Unicode version, then in Saxon we're simply not running it! That's because we don't have a very clear handle on which version of Unicode we actually support, which is (a) because it may be different for different functionality (e.g regexes vs collations vs upper/lower case), and (b) because it may depend on which version of Java you're running under. It's also a bit useless being a single specific Unicode version, it should be a range.
I wonder who’ll benefit from such specific test cases. Can’t we simply drop them, or simplify them in a way that they work for different Unicode versions?
Yes, I think that in most cases if a test result varies between Unicode versions then it's probably best to drop it. By definition, they're not going to make changes in Unicode that have a big impact on users.
In this horrible test, I'm getting two characters that don't match the regex
[\w]
when the test expects that they should:Recall that \w is supposed to match all characters except those in groups P, Z, and C.
In the current Unicode database these are classified as punctuation characters (group P):
2308;LEFT CEILING;Ps;0;ON;;;;;Y;;;;; 2309;RIGHT CEILING;Pe;0;ON;;;;;Y;;;;;
They appear to have changed between Unicode 6.0.0 and 7.0.0 - the database for 6.0.0 has
2308;LEFT CEILING;Sm;0;ON;;;;;Y;;;;; 2309;RIGHT CEILING;Sm;0;ON;;;;;Y;;;;;