google / j2cl

Java to Closure JavaScript transpiler
Apache License 2.0
1.23k stars 144 forks source link

Incorrect `fromCodePoint` emulation #171

Closed jsannemo closed 1 year ago

jsannemo commented 1 year ago

Describe the bug In my Java code, I call Character.isWhitespace, which is implemented as:

    public static boolean isWhitespace(int codePoint) {
        return CharacterData.of(codePoint).isWhitespace(codePoint);
    }

Getting the character from a codepoint is done by String.fromCodePoint() in the generated JavaScript. This, however, is slightly incompatible: fromCodePoint throws for invalid codepoints, while Java simply returns false.

To Reproduce

Call the function with e.g. -1.

Bazel version 5.2.0

Expected behavior The function returns false.

niloc132 commented 1 year ago

That doesn't appear to be the current implementation of Character.isWhitespace: https://github.com/google/j2cl/blob/73cbc716da3435bbc44bbf3065081584f2e11a9d/jre/java/java/lang/Character.java#L309-L330

That has been the implementation since the code was migrated from github.com/gwtproject/gwt, where it was the same: https://github.com/gwtproject/gwt/blob/18c5d993469735d107735b1680d7e667fcc858b8/user/super/com/google/gwt/emul/java/lang/Character.java#L309-L330

Additionally, I can't find a CharacterData class (outside of the elemental2 binding for https://developer.mozilla.org/en-US/docs/Web/API/CharacterData, which has neither an of() nor isWhitespace() method) - can you shed some more light on your code sample, where you are getting these from?


The code above, from the j2cl repository, when uses with Character.isWhitespace(-1), does indeed seem to compile out in a way that makes sense (..minus the a||(a=...); wrapper bit, which doesnt serve an obvious purpose to me), but fails:

var a, b, c;
c=String.fromCodePoint(-1);
a||(a=RegExp("[\\u1680\\u180E\\u2000-\\u2006\\u2008-\\u200A\\u2028\\u2029\\u205F\\u3000\\uFEFF]|[\\t-\\r ]|[\\x1C-\\x1F]"));
b=a.test(c)

As you pointed out, the emulated String.java file has a fromCodePoint method which fails when passed -1 by compiling to JS's String.fromCodePoint(-1) directly.

https://github.com/google/j2cl/blob/73cbc716da3435bbc44bbf3065081584f2e11a9d/jre/java/java/lang/String.java#L221-L223

GWT's implementation is a little different, and fixes this issue by truncating the int to a char.

https://github.com/gwtproject/gwt/blob/18c5d993469735d107735b1680d7e667fcc858b8/user/super/com/google/gwt/emul/java/lang/String.java#L223-L232

Copying only that one method into J2CL seems to fix the bug:

var a,b,c;
c=String.fromCharCode(65535);
a||(a=RegExp("[\\u1680\\u180E\\u2000-\\u2006\\u2008-\\u200A\\u2028\\u2029\\u205F\\u3000\\uFEFF]|[\\t-\\r ]|[\\x1C-\\x1F]"));
b=a.test(c);
> false

It looks like partially reverting or modifying https://github.com/google/j2cl/commit/4f3f95214a4fc89abfb2f71c1f40136c8901b233 will resolve this.

jsannemo commented 1 year ago

Ah, I meant that openjdk is implemented like that (to contrast with j2cl).

gkdn commented 1 year ago

Yes looks like I quite recently added regression on this behavior :)

We can guard isWhiteSpace with a isValidCodePoint check. If this is important for you, pls feel free to send a patch.

niloc132 commented 1 year ago
jshell> IntStream.range(Character.MAX_VALUE, Character.MAX_CODE_POINT).filter(c -> Character.isWhitespace(c)).findFirst()
$1 ==> OptionalInt.empty

It looks like there aren't any valid whitespace characters above Character.MAX_VALUE, so my gut reaction was to implement this as

  public static boolean isWhitespace(int codePoint) {
    if (codePoint > MAX_VALUE || codePoint < 0) {
      // no whitespace exists above Character.MAX_VALUE
      return false;
    }
    return isWhitespace(String.fromCodePoint(codePoint));
  }

and this seems to behave as you would expect.

With that said, you expressed a preference for guarding with a call to isValidCodePoint, so that's what my PR does. https://github.com/google/j2cl/pull/172