konsoletyper / teavm

Compiles Java bytecode to JavaScript, WebAssembly and C
https://teavm.org
Apache License 2.0
2.62k stars 262 forks source link

Initial implementation for native to_____Case #766

Closed Ihromant closed 10 months ago

Ihromant commented 1 year ago

@konsoletyper tried to implement native toUpperCase and toLowerCase methods for String. There are few reasons to do this.

  1. Performance of native call will probably be faster than emulated method.
  2. Code JS size will be reduced.
  3. Possible locale issues are automatically fixed. Also for locale variable which is skipped now https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/toLocaleLowerCase can be used.

For some reason generated methods are not working. I checked that String -> JSString conversion is made just by return str;, so I assumed that variable in method is also native String. But for some reason it is not working.

konsoletyper commented 1 year ago

This approach has its disadvantage: potential inconsistency with Character.getType

konsoletyper commented 1 year ago

You code is not working because in TeaVM strings aren't JS strings, but fair emulated Java object. They have characters field internally which contain array of character codes. You need to perform conversion, see how $rt_str and $rt_ustr used in other generators.

konsoletyper commented 1 year ago

Another source of inconsistency: Character.toLowerCase, Character.isLowerCase

Ihromant commented 1 year ago

Character.toLowerCase can be implemented using same approach wrapToStr(chr).toLowerCase().charAt(0). Character.isLowerCase can be done with wrapToStr(chr).toLowerCase() === wrapToStr(chr). I understand that JDK can have differences with specific browser in its chartables. But currently localed lower/uppercase is not implemented at all. Also tolower/uppercase without parameters according to documentation is toLowerCase(Locale.default()). Current implementation generally is toLowerCase(Locale.ROOT). From other side, both JDK and browsers eventually should follow RFC/tables of Unicode consortium. So, assuming that there is a mismatch in JDK or browser, it will be fixed eventually and implementations will be the same. The question here is whether this approach is acceptable and whether to continue working on this PR?

konsoletyper commented 1 year ago

What about Character.getType?

I understand that JDK can have differences with specific browser in its chartables

I don't say about inconsistencies between browser and JVM. I'm telling about inconsistencies within TeaVM.

Also tolower/uppercase without parameters according to documentation is toLowerCase(Locale.default()). Current implementation generally is toLowerCase(Locale.ROOT).

Can you show me the real-world cases when it's really important?

Ihromant commented 1 year ago

Can you show me the real-world cases when it's really important?

It was just one point (not the most important). My point is that there are inconsistencies already. The question is which are more important. My goal is to have performant String methods and codesize.

What about Character.getType?

I can't come with solution. I found that in JS world in this cases they are using regexps, obviously this is not our case. From other side, I took a look at methods that are using Character.getType() in JDK.

TCharacter.isSpaceChar(int)  (org.teavm.classlib.java.lang) // known and rarely changed
TCharacter.isUpperCase(int)  (org.teavm.classlib.java.lang) // solvable
TCharacter.isTitleCase(int)  (org.teavm.classlib.java.lang)  // here could be issue, but rare https://stackoverflow.com/questions/31770995/difference-between-uppercase-and-titlecase
TCharacter.isLetter(int)  (org.teavm.classlib.java.lang) // probably is the combination of other checks, but inconsistencies include only rare cases
TCharacter.isJavaIdentifierPart(int)  (org.teavm.classlib.java.lang) // don't know what it is
TCharacter.isDefined(int)  (org.teavm.classlib.java.lang) // suppose that solvable
TCharacter.isUnicodeIdentifierStart(int)  (org.teavm.classlib.java.lang) // don't know what it is for now
TCharacter.isJavaIdentifierStart(int)  (org.teavm.classlib.java.lang) // don't know what it is for now
TCharacter.isUnicodeIdentifierPart(int)  (org.teavm.classlib.java.lang) // don't know what it is for now
TCharacter.isIdentifierIgnorable(int)  (org.teavm.classlib.java.lang) // don't know what it is for now
TCharacter.isLowerCase(int)  (org.teavm.classlib.java.lang) // solvable
TCharacter.isAlphabetic(int)  (org.teavm.classlib.java.lang) // solvable
TCharacter.isDigit(int)  (org.teavm.classlib.java.lang) // solvable
TCharacter.isLetterOrDigit(int)  (org.teavm.classlib.java.lang) // solvable

So, generally what is unknown is the issue with Java-Unicode modifiers and this TitleCase case (which influences isLetter).

Ihromant commented 1 year ago

Now normal test run is successful, but it fails in optimized code for some reason. It seems that it's not properly converts String literals.

konsoletyper commented 1 year ago

Did you try to debug the issue?

Ihromant commented 1 year ago

How to debug JS which is launched from unvisible Chrome? Here are all included parts that cause the fail:

function jl_String_toUpperCaseNative(var$1) {
        return $rt_str($rt_ustr(var$1).toUpperCase());
    }
...
var$7 = jl_String_toUpperCaseNative($rt_s(8));
...
var$8 = var$7.$hashCode0(); // here is fail

It was wrapped and unwrapped, but error is

java.lang.RuntimeException: null: (JavaScript) TypeError: var$7.$hashCode0 is not a function

rt_s(8) is constant pool constant corresponding to "UTF-8". In non-minified js there is

"$hashCode0", $rt_wrapFunction0(jl_String_hashCode)

which is not present in minified JS. But I don't know the reason why. Test which tests toLowerCase and toUpperCase passes, other tests are failing, for instance: createdFromByteArray (which I'm investigating)

konsoletyper commented 1 year ago

My point is that there are inconsistencies already

Please, don't confise inconsistent and invalid behaviour. TeaVM's class library does not behave exactly according to spec, however, it's internally consistent. What you propose it to make some parts of TeaVM classlib to behave according to spec, and some (which are connected to previous parts) not according to spec. That was my point. The problem here is that it's impossible to achieve consistency, since JS is missing some functionality.

My goal is to have performant String methods and codesize.

As for performance: you can try to fix this PR and measure performance of generated code with and without this patch? Can you investigate JS benchmarking tools like JMH for Java?

konsoletyper commented 1 year ago

How to debug JS which is launched from unvisible Chrome?

You do know where to find these JS files, right? So what's the problem to open them in the browser? BTW, TeaVM already cares about it and puts html files as well. You just open HTML file in your browser and then open debugging tools.

konsoletyper commented 1 year ago

I noticed that you are using $rt_str and $rt_ustr as direct text. That's wrong, please find out any of their usages in existing code. There was a special function to write calls to runtime.

java.lang.RuntimeException: null: (JavaScript) TypeError: var$7.$hashCode0 is not a function

Sure. Dependency analyzer knows nothing about internals of your native function. You should hint it by implementing DependencyPlugin and binding it to method with @PluggableDependency. Just tell it that the resulting value is java.lang.String (i.e. call getResult().propagate)

Ihromant commented 1 year ago

@konsoletyper many thanks for help. Tests now are working. Added tests for uppercase and lowercase. It seems that most popular case of native upperCase and LowerCase is consistent with JVM Locales.

Ihromant commented 1 year ago

Just tried simplest possible benchmark.

List<String> strings = IntStream.range(0, 1000000).mapToObj(i -> UUID.randomUUID().toString()).collect(Collectors.toList());
        int l = 0;
        long time = System.currentTimeMillis();
        for (String s : strings) {
            l += s.toUpperCase().hashCode();
        }
        System.out.println((System.currentTimeMillis() - time) + " " + l);

Generated two files classes_native.js and classes_old.js. Then run it it chromium and firefox by replacing js filename. Results: |Browser|old|new| |Chromium|2900-3300|2000-2400| |Firefox|2700-2900|2100-2300| Seems that on my computer firefox is faster and produces more stable results. Still, it seems that even this small change produces 25-30% performance improvement.

konsoletyper commented 1 year ago
  1. Can you please add warn-up loop before long time =?
  2. Can you please experiment with different length strings (short strings, long strings, mixed strings)?
  3. Can you please compare generated JS file size?
konsoletyper commented 1 year ago

It seems that most popular case of native upperCase and LowerCase is consistent with JVM Locales.

Good, but they aren't consistent with other locale-specific methods of TeaVM

Ihromant commented 1 year ago
private static final int SAMPLE_SIZE = 100_000;
    public static void main(String[] args) {
        String[] strings = new String[SAMPLE_SIZE];
        generateSample(strings, () -> UUID.randomUUID().toString());
        int l = 0;
        for (String s : strings) {
            l -= s.toUpperCase().hashCode();
        }
        System.out.println("Warmup: " + l);
        generateSample(strings, () -> UUID.randomUUID().toString());
        long time = System.currentTimeMillis();
        for (String s : strings) {
            l += s.toUpperCase().hashCode();
        }
        System.out.println("UUID strings: " + (System.currentTimeMillis() - time) + ", sink: " + l);
        generateSample(strings, () -> UUID.randomUUID().toString().substring(0, ThreadLocalRandom.current().nextInt(1, 5)));
        time = System.currentTimeMillis();
        for (String s : strings) {
            l += s.toUpperCase().hashCode();
        }
        System.out.println("Short strings: " + (System.currentTimeMillis() - time) + ", sink: " + l);
        generateSample(strings, Client::generateLongString);
        time = System.currentTimeMillis();
        for (String s : strings) {
            l += s.toUpperCase().length();
        }
        System.out.println("Long strings: " + (System.currentTimeMillis() - time) + ", sink: " + l);
        generateSample(strings, Client::generateMixed);
        time = System.currentTimeMillis();
        for (String s : strings) {
            l += s.toUpperCase().length();
        }
        System.out.println("Mixed strings: " + (System.currentTimeMillis() - time) + ", sink: " + l);
    }

    private static void generateSample(String[] array, Supplier<String> generator) {
        for (int i = 0; i < array.length; i++) {
            array[i] = generator.get();
        }
    }

    private static String generateLongString() {
        StringBuilder builder = new StringBuilder();
        int size = ThreadLocalRandom.current().nextInt(90, 100);
        for (int i = 0; i < size; i++) {
            builder.append(UUID.randomUUID());
        }
        return builder.toString();
    }

    private static String generateMixed() {
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < 100; i++) {
            builder.append(UUID.randomUUID());
        }
        return builder.substring(0, ThreadLocalRandom.current().nextInt(1, 3500));
    }

Old code size: 59,8 KiB New code size: 55.0 KiB |Browser|old uuid|new uuid|old short|new short| |Chromium|283,286,290|196,141,206|25,26,26|83,48,68| |Firefox|283,272,276|224,236,222|59,61,59|94,90,110|

|Browser|old long|new long|old mixed|new mixed| |Chromium|10453,10796,10621|3926,3916,3981|5410,5622,5554|2048,2056,2166| |Firefox|20633,20107,21322|5155,5171,5090|11154,10286,11401|2630,2578,2644|

Seems that old method is faster on short strings (1-5 characters), but new measured values seem to be polluted by some side-effects. Still, on UUID-size (36 chars) new method is already faster and produced measurements are reliable. On long strings new approach is significantly faster (up to 3-4 times).

konsoletyper commented 1 year ago

I just pushed optimization for toLowerCase/toUpperCase, and after it these methods works 2.5 times faster. So performance is no more a reason to apply this PR, at least until there's a way to avoid copying.

konsoletyper commented 1 year ago

Please, wait, don't close this. I have some ideas, may be we'll return back to this PR few weeks later

Ihromant commented 10 months ago

@konsoletyper I suppose that after adaptation this PR can be reviewed and merged.