Work around a UnicodeSet bug

eggrobin commented 3 months ago

See https://www.unicode.org/reports/tr35/#unicodeset-syntax, \N{whatever} is a quoted, which is a char, which is an element, so one should be able to use it in a range.

ICU treats it a a synonym for \p{Name=whatever}, which is a unicodeSet, so that the thing that should be a range is a set difference (equal to its LHS in practice).

That also means that \N{nonexistent name} is silently empty.

We should fix it in ICU, but we urgently need to fix it here for UCD development, because we have already been bitten by it, with no-one noticing; the following only tests the first letter of the Garay alphabet: https://github.com/unicode-org/unicodetools/blob/cccbe935932337bc7f87b8a7b151d6e2fcf78a17/unicodetools/src/main/resources/org/unicode/text/UCD/AdditionComparisons.txt#L27-L34

markusicu commented 3 months ago

test failure: https://github.com/unicode-org/unicodetools/actions/runs/10322995005/job/28579439631?pr=908

ParseErrorCount=6
TestFailureCount=0
Error:  Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 63.737 s <<< FAILURE! - in org.unicode.text.UCD.TestTestUnicodeInvariants
Error:  org.unicode.text.UCD.TestTestUnicodeInvariants.testUnicodeInvariants  Time elapsed: 63.719 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: TestUnicodeInvariants.testInvariants(default) failed ==> expected: <0> but was: <6>
    at org.unicode.text.UCD.TestTestUnicodeInvariants.testUnicodeInvariants(TestTestUnicodeInvariants.java:39)

markusicu commented 3 months ago

Is there an ICU ticket? Would you be willing to work on it?

markusicu commented 3 months ago

@aheninger FYI

markusicu commented 3 months ago

The ICU User Guide does not go into details about \N: https://unicode-org.github.io/icu/userguide/strings/regexp.html

eggrobin commented 3 months ago

Is there an ICU ticket?

Not yet. Could you file one?

Would you be willing to work on it?

Yes, if you file the ticket please assign it to me.

markusicu commented 3 months ago

Actually, at a closer look, while there is some documentation for ICU regex, there is no documentation for \N for UnicodeSet:

Trying this in an old C++ UnicodeSet demo gives a weird result: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFFF&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetk.x=12&gosetk.y=23

I think this is not an ICU bug because ICU UnicodeSet does not support \N.

In the unicodetools, it must be handled by Mark's SymbolTable implementation. And that probably has no way of distinguishing between "set" and "character" results -- looking at https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/UnicodeSet.XSymbolTable.html

markusicu commented 3 months ago

Actually, clicking a different button on the old demo seems to recognize \N: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFF0&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetn.x=17&gosetn.y=17

Could you please try this in vanilla ICU4J UnicodeSet, without the Unicode Tools machinery?

eggrobin commented 3 months ago

@markusicu wrote:

I think this is not an ICU bug because ICU UnicodeSet does not support \N.

https://github.com/unicode-org/icu/blob/b5b3e16afac61f9aa9b775aaf497f8cc88ce9481/icu4j/main/core/src/main/java/com/ibm/icu/text/UnicodeSet.java#L3741-L3751

https://github.com/unicode-org/icu/blob/b5b3e16afac61f9aa9b775aaf497f8cc88ce9481/icu4j/main/core/src/main/java/com/ibm/icu/text/UnicodeSet.java#L3791

https://github.com/unicode-org/icu/blob/b5b3e16afac61f9aa9b775aaf497f8cc88ce9481/icu4j/main/core/src/main/java/com/ibm/icu/text/UnicodeSet.java#L3802-L3812

https://github.com/unicode-org/icu/blob/b5b3e16afac61f9aa9b775aaf497f8cc88ce9481/icu4j/main/core/src/main/java/com/ibm/icu/text/UnicodeSet.java#L3835-L3850

eggrobin commented 3 months ago

Support for \N was added on 2002-08-28, in https://github.com/unicode-org/icu/commit/681c0468a4139488ec92e7d027c6aeaab819ae01, for ICU-1767.

markusicu commented 3 months ago

Next thing to check in ICU behavior is whether we allow a subtraction like CHARACTER_CLASS '-' LITERAL. (Using UTS18 syntax rule names here.) If we do, then turning \N into a LITERAL would be harmless. If we don't, then we could potentially break existing patterns.

--> ICU-22851 “UnicodeSet should treat \N like a character not like a set”

aheninger commented 3 months ago

The ICU User Guide does not go into details about \N: https://unicode-org.github.io/icu/userguide/strings/regexp.html

For ICU4C regex, \N{whatever} matches exactly one code point. The whatever string, delimited by the braces, is dumped into UChar32 theChar = u_charFromName(U_UNICODE_CHAR_NAME, name, fStatus); to get the character.

unicode-org / unicodetools

Work around a UnicodeSet bug #908