w3c / qt3tests

Tests for XPath and XQuery
27 stars 17 forks source link

matches.re.xml re00984 considers 8968 and 8969 as word but they are punctuation #62

Closed faassen closed 4 months ago

faassen commented 4 months ago

In the test re00984 the characters ⌈ and &#8969, that is, ⌈ and ⌉ are considered to be word characters, but they're in the set P, punctuation. According to the XML Schema 1.1 spec they should therefore not be considered to be in \w:

[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)

See for instance here:

https://en.wiktionary.org/wiki/Appendix:Unicode/Miscellaneous_Technical

Or directly from the UCD's UnicodeData.txt:

2308;LEFT CEILING;Ps;0;ON;;;;;Y;;;;;
2309;RIGHT CEILING;Pe;0;ON;;;;;Y;;;;;

My proposal is to remove these two characters from the test.

faassen commented 4 months ago

Ah, I see you fixed it already just now!