Fix CESU8Encoding.leftAdjustCharHead

jruby / jcodings

Java-based codings helper classes for Joni and JRuby

MIT License

20 stars 29 forks source link

Fix CESU8Encoding.leftAdjustCharHead #61

Closed djoooooe closed 1 year ago

djoooooe commented 1 year ago

CESU8Encoding.leftAdjustCharHead currently does not properly handle 6-byte sequences. This pull request aims to fix that.

headius commented 1 year ago

Nice, thank you! Will review and merge.

headius commented 1 year ago

It looks ok to me but the equivalent code in CRuby for left_adjust_char_head does not have the extra code you added:

static UChar*
left_adjust_char_head(const UChar* start, const UChar* s, const UChar* end, OnigEncoding enc ARG_UNUSED)
{
  const UChar *p;

  if (s <= start) return (UChar* )s;
  p = s;

  while (!utf8_islead(*p) && p > start) p--;
  return (UChar* )p;
}

Does this mean CRuby is not doing this right? Or else why do we need this code?

djoooooe commented 1 year ago

I'd say that CRuby is not doing it right: As far as i understand the JCodings API, left_adjust_char_head should move the cursor left by an entire codepoint. CESU-8 encodes codepoints greater than 0xffff as a UTF-16 surrogate pair, where the individual surrogate values are encoded in UTF-8, so they look like two individual UTF-8 codepoints; When left_adjust_char_head only moves to the previous UTF-8 codepoint head, it will move into the "middle" of the surrogate pair instead of the beginning of the full codepoint.

djoooooe commented 1 year ago

The new unit test I included in this pull request demonstrates this: There is already a unit test that verifies that the codepoint length of the surrogate pair "\u00ed\u00a0\u0080\u00ed\u00b0\u0080" is 1. The new test also verifies that left_adjust_char_head moves past the entire sequence. Without the patch, left_adjust_char_head only moves to byte index 3.

djoooooe commented 1 year ago

Another comment I got was:

there os USE_INVALID_CODE_SCHEME as truee wrt: #ifdef USE_INVALID_CODE_SCHEME if (p > 0xfd) { return ((p == 0xfe) ? INVALID_CODE_FE : INVALID_CODE_FF); }

I'm sorry, i don't know how this interacts with the implementation of left_adjust_char_head. Could you elaborate?

headius commented 1 year ago

I'd say that CRuby is not doing it right

... it will move into the "middle" of the surrogate pair instead of the beginning of the full codepoint.

Ok that makes sense. I think we should bring this up with CRuby folks (https://bugs.ruby-lang.org) and probably eventually (after they accept it as a bug) move the test into spec/ruby so it can be upstreamed to the other implementations.

I don't see any reason not to merge this now. 👍

headius commented 1 year ago

Oh, for bonus points you could make a PR for CRuby that fixes it in the same way!

enebo commented 1 year ago

@djoooooe Yeah I just got that from someone as a comment. I think it was mentioned only because it was in the same method you had modified but in reading this myself I realize the logic is actually there and it is not even an issue in matching up to C codebase. It just uses true constant to look more closely like the C. So no changes for that.

djoooooe commented 1 year ago

Oh, for bonus points you could make a PR for CRuby that fixes it in the same way!

Done: https://github.com/ruby/ruby/pull/7510

enebo commented 1 year ago

@djoooooe whoops we should have did that on Monday :) Thanks for your contribution.

djoooooe commented 1 year ago

@enebo You're welcome :) Thanks for the quick responses!

headius commented 1 year ago

We can spin a release of this any time. If you need it in JRuby, that might take a little longer.