Closed sco0ter closed 5 years ago
Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).
Looking at the implementation I think it is clear why the test breaks. As found on diverse web sites, case folding is not possible by simply chaining upper/lower operations. This works well for latin. But not for some other languages. With JDK 9 cherokee language was added -- which has exactly this problem. So the question is, what algorithm do we actually want for such a case?
Original comment by Christian Schudt (Bitbucket: sco0ter, GitHub: sco0ter).
I've spent some hours on this issue already, without a good solution.
The test seems to pass for the cherokee code points, if toUpperCase() are applied to them only. This is really weird, because for every other code points, case folding works with toUpper().toLower() (as the test shows).
Actually it feels like a bug in Java 9 then.
I've tried many things with the Character class to detect cherokee codepoints, but with no success.
We could also simply create a static mapping (by parsing the CaseFolding.txt) and use that mapping for case folding.
At the moment I don't really care about Cherokee codepoints, so I'll leave this issue open.
Original comment by Christian Schudt (Bitbucket: sco0ter, GitHub: sco0ter).
I am also not sure, how "case folding" really works.
When I implemented it, I've found this (among others): https://docs.atlassian.com/jira/7.1.6/com/atlassian/jira/util/CaseFolding.html (trick with toLower(toUpper())).
and it indeed did the trick (at least for Java 8).
I am not sure, if it's a bug, but it's at least weird, that these new few code points behave different now.
Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).
I look how String.toLowerCase is implemented in Java 9 and it is pretty clear that the solution is error-prone: It contains hard-coded special cases for particular code points. I assume that they forgot to add more special cases for the new Unicode 8 code points. I will check it a bit deeper the next days, but it might need a while, as I need to prepare EclipseCon first.
Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).
Meanwhile Oracle confirmed that it is a bug. While I am confident that they will some day fix it, I do not think it will be any time soon. So I developed a workaround: The test case simply skips the cherokee unicode block. Works pretty well.
Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).
Meanwhile I got an answer from Oracle. There is no bug in the JRE actually, but it is definitively a bug in Precis!
...the addition of lowercase Cherokee letters as of Version 8.0 of the
Unicode Standard, together with the stability guarantees for case folding, require that
Cherokee letters be case folded to their uppercase counterparts. As a result, a case folded
string is not necessarily lowercase.
As a result, we have three choices to be able to be able to compile on JDK 9 (or more precisely, on any JDK supporting Unicode 8.0).
PrecisProfile.caseFold
method is returning in incorrect result.PrecisProfile.caseFold
. Same negative effect, and the implementation will slow down a bit, as we have to check each single character whether it is Cherokee or not.So the question is: What do you like me to do? For sake of correctness and performance, I'd go with option 3 and preload the file in a map at class loading. But it is your project, you have to decide what my PR shall do.
Original comment by Christian Schudt (Bitbucket: sco0ter, GitHub: sco0ter).
Markus, thanks for caring and checking this issue with Oracle. Really weird though.
I'd also go with option 3.
What about option 4?:
Iterating over each character and check if there's a Cherokee char in the String? If not, do toUpperCase().toLowerCase()
on the String.
If yes, manually assemble the case folded string (treating Cherokee chars with toUpperCase()
and every other char with toUpperCase().toLowerCase()
.
The downside is, that we have to create a String for each Character and also consider the surrogate pairs, which probably can get tricky.
Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).
Option 4 would work for now with Cherokee, but possibly break again in future once Unicode 9.0.0 again adopts the next strange set of rules for another ancient language... So will try to implement option 3 now. Shouldn't be too complex.
Original comment by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).
Implemented option 3. Works pretty well. Even more, it now provides FULL case folding according to Unicode 8.0. Class loading is a bit slow now (100ms) but should not be problem in the real world. See https://bitbucket.org/mkarg/precis/commits/0f00d2ed2ea872ebc85d311fdf2fbdaa2ddb74c6.
I will send a pull request once I finished my current work on the module-info.class tests in Babbler (intentionally keeping the fix unmerged unless I am really sure that no other JDK 9 related precis fixes are needed).
Original report by Markus KARG (Bitbucket: mkarg, GitHub: mkarg).
On JDK 9 CaseFoldingTests fails at code point 13F8, as it apparently is not mapped to 13F0 (as requested by (CaseFolding-8.0.0.txt).
I assume that the explicit support for Unicode 8 in Java 9 causes this problem here, because that characters possibly simply had not been contained in Java 8.