Closed eggrobin closed 3 months ago
test failure: https://github.com/unicode-org/unicodetools/actions/runs/10322995005/job/28579439631?pr=908
ParseErrorCount=6
TestFailureCount=0
Error: Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 63.737 s <<< FAILURE! - in org.unicode.text.UCD.TestTestUnicodeInvariants
Error: org.unicode.text.UCD.TestTestUnicodeInvariants.testUnicodeInvariants Time elapsed: 63.719 s <<< FAILURE!
org.opentest4j.AssertionFailedError: TestUnicodeInvariants.testInvariants(default) failed ==> expected: <0> but was: <6>
at org.unicode.text.UCD.TestTestUnicodeInvariants.testUnicodeInvariants(TestTestUnicodeInvariants.java:39)
Is there an ICU ticket? Would you be willing to work on it?
@aheninger FYI
The ICU User Guide does not go into details about \N
: https://unicode-org.github.io/icu/userguide/strings/regexp.html
Is there an ICU ticket?
Not yet. Could you file one?
Would you be willing to work on it?
Yes, if you file the ticket please assign it to me.
Actually, at a closer look, while there is some documentation for ICU regex, there is no documentation for \N
for UnicodeSet:
Trying this in an old C++ UnicodeSet demo gives a weird result: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFFF&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetk.x=12&gosetk.y=23
I think this is not an ICU bug because ICU UnicodeSet does not support \N.
In the unicodetools, it must be handled by Mark's SymbolTable implementation. And that probably has no way of distinguishing between "set" and "character" results -- looking at https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/UnicodeSet.XSymbolTable.html
Actually, clicking a different button on the old demo seems to recognize \N: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFF0&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetn.x=17&gosetn.y=17
Could you please try this in vanilla ICU4J UnicodeSet, without the Unicode Tools machinery?
@markusicu wrote:
I think this is not an ICU bug because ICU UnicodeSet does not support \N.
Support for \N
was added on 2002-08-28, in https://github.com/unicode-org/icu/commit/681c0468a4139488ec92e7d027c6aeaab819ae01, for ICU-1767.
Next thing to check in ICU behavior is whether we allow a subtraction like CHARACTER_CLASS '-' LITERAL. (Using UTS18 syntax rule names here.) If we do, then turning \N into a LITERAL would be harmless. If we don't, then we could potentially break existing patterns.
--> ICU-22851 “UnicodeSet should treat \N like a character not like a set”
The ICU User Guide does not go into details about
\N
: https://unicode-org.github.io/icu/userguide/strings/regexp.html
For ICU4C regex, \N{whatever} matches exactly one code point.
The whatever string, delimited by the braces, is dumped into
UChar32 theChar = u_charFromName(U_UNICODE_CHAR_NAME, name, fStatus);
to get the character.
See https://www.unicode.org/reports/tr35/#unicodeset-syntax,
\N{whatever}
is aquoted
, which is achar
, which is anelement
, so one should be able to use it in arange
.ICU treats it a a synonym for
\p{Name=whatever}
, which is aunicodeSet
, so that the thing that should be a range is a set difference (equal to its LHS in practice).That also means that
\N{nonexistent name}
is silently empty.We should fix it in ICU, but we urgently need to fix it here for UCD development, because we have already been bitten by it, with no-one noticing; the following only tests the first letter of the Garay alphabet: https://github.com/unicode-org/unicodetools/blob/cccbe935932337bc7f87b8a7b151d6e2fcf78a17/unicodetools/src/main/resources/org/unicode/text/UCD/AdditionComparisons.txt#L27-L34