13F8 is not mapped to 13F0 on JDK 9

sco0ter commented 7 years ago

Original report by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

On JDK 9 CaseFoldingTests fails at code point 13F8, as it apparently is not mapped to 13F0 (as requested by (CaseFolding-8.0.0.txt).

I assume that the explicit support for Unicode 8 in Java 9 causes this problem here, because that characters possibly simply had not been contained in Java 8.

sco0ter commented 7 years ago

Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

Looking at the implementation I think it is clear why the test breaks. As found on diverse web sites, case folding is not possible by simply chaining upper/lower operations. This works well for latin. But not for some other languages. With JDK 9 cherokee language was added -- which has exactly this problem. So the question is, what algorithm do we actually want for such a case?

sco0ter commented 7 years ago

Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

@sco0ter How to proceed?

sco0ter commented 7 years ago

Original comment by Christian Schudt (Bitbucket: sco0ter, GitHub: sco0ter).

I've spent some hours on this issue already, without a good solution.

The test seems to pass for the cherokee code points, if toUpperCase() are applied to them only. This is really weird, because for every other code points, case folding works with toUpper().toLower() (as the test shows).

Actually it feels like a bug in Java 9 then.

I've tried many things with the Character class to detect cherokee codepoints, but with no success.

We could also simply create a static mapping (by parsing the CaseFolding.txt) and use that mapping for case folding.

At the moment I don't really care about Cherokee codepoints, so I'll leave this issue open.

sco0ter commented 7 years ago

Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

Well, if it is a bug in Java 9 I could report it and we just wait for a fix. On the other hand, I am not a PRECIS guru, so can we be sure that it IS a bug?

sco0ter commented 7 years ago

Original comment by Christian Schudt (Bitbucket: sco0ter, GitHub: sco0ter).

I am also not sure, how "case folding" really works.

When I implemented it, I've found this (among others): https://docs.atlassian.com/jira/7.1.6/com/atlassian/jira/util/CaseFolding.html (trick with toLower(toUpper())).

and it indeed did the trick (at least for Java 8).

I am not sure, if it's a bug, but it's at least weird, that these new few code points behave different now.

sco0ter commented 7 years ago

Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

I look how String.toLowerCase is implemented in Java 9 and it is pretty clear that the solution is error-prone: It contains hard-coded special cases for particular code points. I assume that they forgot to add more special cases for the new Unicode 8 code points. I will check it a bit deeper the next days, but it might need a while, as I need to prepare EclipseCon first.

sco0ter commented 7 years ago

Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

As in my understanding it is clearly a bug in JDK 9 I just reported it to Oracle. Let's see what happens next.

sco0ter commented 6 years ago

Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

Meanwhile Oracle confirmed that it is a bug. While I am confident that they will some day fix it, I do not think it will be any time soon. So I developed a workaround: The test case simply skips the cherokee unicode block. Works pretty well.

sco0ter commented 6 years ago

Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

Meanwhile I got an answer from Oracle. There is no bug in the JRE actually, but it is definitively a bug in Precis!

Quote from the Unicode 8.0 specification (page 156)

...the addition of lowercase Cherokee letters as of Version 8.0 of the
Unicode Standard, together with the stability guarantees for case folding, require that
Cherokee letters be case folded to their uppercase counterparts. As a result, a case folded
string is not necessarily lowercase.

As a result, we have three choices to be able to be able to compile on JDK 9 (or more precisely, on any JDK supporting Unicode 8.0).

Adopt my existing workaround which, i. e. simply skipping Cherokee symbols in the test case. This has the side effect that -at least for Cherokee symbols- the PrecisProfile.caseFold method is returning in incorrect result.
Keep the test as it is, but skip Cherokee symbols in the implementation of PrecisProfile.caseFold. Same negative effect, and the implementation will slow down a bit, as we have to check each single character whether it is Cherokee or not.
Implement a fully correct lookup to the mapping file as required per the above quote of the spec. This certainly results in a perfect result, but will either eat a lot of CPU cycles (live lookup in the file) or eat up a lot of RAM (preload and cach map in-memory).

So the question is: What do you like me to do? For sake of correctness and performance, I'd go with option 3 and preload the file in a map at class loading. But it is your project, you have to decide what my PR shall do.

sco0ter commented 6 years ago

Original comment by Christian Schudt (Bitbucket: sco0ter, GitHub: sco0ter).

Markus, thanks for caring and checking this issue with Oracle. Really weird though.

I'd also go with option 3.

What about option 4?: Iterating over each character and check if there's a Cherokee char in the String? If not, do toUpperCase().toLowerCase() on the String. If yes, manually assemble the case folded string (treating Cherokee chars with toUpperCase() and every other char with toUpperCase().toLowerCase(). The downside is, that we have to create a String for each Character and also consider the surrogate pairs, which probably can get tricky.

sco0ter commented 6 years ago

Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

Option 4 would work for now with Cherokee, but possibly break again in future once Unicode 9.0.0 again adopts the next strange set of rules for another ancient language... So will try to implement option 3 now. Shouldn't be too complex.

sco0ter commented 6 years ago

Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).

Implemented option 3. Works pretty well. Even more, it now provides FULL case folding according to Unicode 8.0. Class loading is a bit slow now (100ms) but should not be problem in the real world. See https://bitbucket.org/mkarg/precis/commits/0f00d2ed2ea872ebc85d311fdf2fbdaa2ddb74c6.

I will send a pull request once I finished my current work on the module-info.class tests in Babbler (intentionally keeping the fix unmerged unless I am really sure that no other JDK 9 related precis fixes are needed).

sco0ter commented 5 years ago

Original comment by Christian Schudt (Bitbucket: sco0ter, GitHub: sco0ter).

Casefolding was used in RFC 7564 which is no longer used in RFC 8264.

Instead of case folding, toLowerCase() is now used.

See 850d123a

Since then it compiles with JDK 9, too.

I am closing this issue.

sco0ter / precis

13F8 is not mapped to 13F0 on JDK 9 #1

Quote from the Unicode 8.0 specification (page 156)