Closed djoooooe closed 1 year ago
Nice, thank you! Will review and merge.
It looks ok to me but the equivalent code in CRuby for left_adjust_char_head does not have the extra code you added:
static UChar*
left_adjust_char_head(const UChar* start, const UChar* s, const UChar* end, OnigEncoding enc ARG_UNUSED)
{
const UChar *p;
if (s <= start) return (UChar* )s;
p = s;
while (!utf8_islead(*p) && p > start) p--;
return (UChar* )p;
}
Does this mean CRuby is not doing this right? Or else why do we need this code?
I'd say that CRuby is not doing it right:
As far as i understand the JCodings API, left_adjust_char_head
should move the cursor left by an entire codepoint.
CESU-8 encodes codepoints greater than 0xffff
as a UTF-16 surrogate pair, where the individual surrogate values are encoded in UTF-8, so they look like two individual UTF-8 codepoints; When left_adjust_char_head
only moves to the previous UTF-8 codepoint head, it will move into the "middle" of the surrogate pair instead of the beginning of the full codepoint.
The new unit test I included in this pull request demonstrates this: There is already a unit test that verifies that the codepoint length of the surrogate pair "\u00ed\u00a0\u0080\u00ed\u00b0\u0080"
is 1
. The new test also verifies that left_adjust_char_head
moves past the entire sequence. Without the patch, left_adjust_char_head
only moves to byte index 3
.
Another comment I got was:
there os USE_INVALID_CODE_SCHEME as truee wrt: #ifdef USE_INVALID_CODE_SCHEME if (p > 0xfd) { return ((p == 0xfe) ? INVALID_CODE_FE : INVALID_CODE_FF); }
I'm sorry, i don't know how this interacts with the implementation of left_adjust_char_head
. Could you elaborate?
I'd say that CRuby is not doing it right
... it will move into the "middle" of the surrogate pair instead of the beginning of the full codepoint.
Ok that makes sense. I think we should bring this up with CRuby folks (https://bugs.ruby-lang.org) and probably eventually (after they accept it as a bug) move the test into spec/ruby so it can be upstreamed to the other implementations.
I don't see any reason not to merge this now. 👍
Oh, for bonus points you could make a PR for CRuby that fixes it in the same way!
@djoooooe Yeah I just got that from someone as a comment. I think it was mentioned only because it was in the same method you had modified but in reading this myself I realize the logic is actually there and it is not even an issue in matching up to C codebase. It just uses true constant to look more closely like the C. So no changes for that.
Oh, for bonus points you could make a PR for CRuby that fixes it in the same way!
@djoooooe whoops we should have did that on Monday :) Thanks for your contribution.
@enebo You're welcome :) Thanks for the quick responses!
We can spin a release of this any time. If you need it in JRuby, that might take a little longer.
CESU8Encoding.leftAdjustCharHead
currently does not properly handle 6-byte sequences. This pull request aims to fix that.