Test fn-matches.re/re00984

qt4cg / qt4tests

QT4 tests

https://qt4cg.org/

3 stars 7 forks source link

Test fn-matches.re/re00984 #117

Closed michaelhkay closed 2 months ago

michaelhkay commented 7 months ago

In this horrible test, I'm getting two characters that don't match the regex [\w]when the test expects that they should:

8968 - U+2308 - left ceiling
8969 - U+2309 - right ceiling

Recall that \w is supposed to match all characters except those in groups P, Z, and C.

In the current Unicode database these are classified as punctuation characters (group P):

2308;LEFT CEILING;Ps;0;ON;;;;;Y;;;;; 2309;RIGHT CEILING;Pe;0;ON;;;;;Y;;;;;

They appear to have changed between Unicode 6.0.0 and 7.0.0 - the database for 6.0.0 has

2308;LEFT CEILING;Sm;0;ON;;;;;Y;;;;; 2309;RIGHT CEILING;Sm;0;ON;;;;;Y;;;;;

ndw commented 7 months ago

It's not possible to refer to Unicode character classes without considering the version of Unicode in use. (For my Invisible XML processor, I wound up having to implement that part myself as Java doesn't implement any version of Unicode. It implements some version (I forget which and it probably varies by Java vesrion), plus the extra bits from the next version that they though were important 🙄 )

I think we should add a "Unicode version" feature so that we can assert results correctly with respect to the version of Unicode.

michaelhkay commented 7 months ago

It is possible to annotate a test with

<dependency type="unicode-version" value="3.1.1"/>

but the feature isn't widely used (less than a dozen tests, I think). In fact, it appears that if a test has a dependency on a specific Unicode version, then in Saxon we're simply not running it! That's because we don't have a very clear handle on which version of Unicode we actually support, which is (a) because it may be different for different functionality (e.g regexes vs collations vs upper/lower case), and (b) because it may depend on which version of Java you're running under. It's also a bit useless being a single specific Unicode version, it should be a range.

ChristianGruen commented 7 months ago

I wonder who’ll benefit from such specific test cases. Can’t we simply drop them, or simplify them in a way that they work for different Unicode versions?

michaelhkay commented 7 months ago

Yes, I think that in most cases if a test result varies between Unicode versions then it's probably best to drop it. By definition, they're not going to make changes in Unicode that have a big impact on users.

ChristianGruen commented 2 months ago

https://github.com/qt4cg/qt4tests/commit/1586f70cda14f2eaaef65baf4415ae97f9bbba7a#diff-37953b90b6f4491541e281e5bca5d4a4c531df00afa3051faada9a9c1509d8a0R8096