Closed Ihromant closed 10 months ago
This approach has its disadvantage: potential inconsistency with Character.getType
You code is not working because in TeaVM strings aren't JS strings, but fair emulated Java object. They have characters field internally which contain array of character codes. You need to perform conversion, see how $rt_str
and $rt_ustr
used in other generators.
Another source of inconsistency: Character.toLowerCase, Character.isLowerCase
Character.toLowerCase can be implemented using same approach wrapToStr(chr).toLowerCase().charAt(0)
. Character.isLowerCase can be done with wrapToStr(chr).toLowerCase() === wrapToStr(chr)
.
I understand that JDK can have differences with specific browser in its chartables. But currently localed lower/uppercase is not implemented at all. Also tolower/uppercase without parameters according to documentation is toLowerCase(Locale.default()). Current implementation generally is toLowerCase(Locale.ROOT).
From other side, both JDK and browsers eventually should follow RFC/tables of Unicode consortium. So, assuming that there is a mismatch in JDK or browser, it will be fixed eventually and implementations will be the same.
The question here is whether this approach is acceptable and whether to continue working on this PR?
What about Character.getType?
I understand that JDK can have differences with specific browser in its chartables
I don't say about inconsistencies between browser and JVM. I'm telling about inconsistencies within TeaVM.
Also tolower/uppercase without parameters according to documentation is toLowerCase(Locale.default()). Current implementation generally is toLowerCase(Locale.ROOT).
Can you show me the real-world cases when it's really important?
Can you show me the real-world cases when it's really important?
It was just one point (not the most important). My point is that there are inconsistencies already. The question is which are more important. My goal is to have performant String methods and codesize.
What about Character.getType?
I can't come with solution. I found that in JS world in this cases they are using regexps, obviously this is not our case. From other side, I took a look at methods that are using Character.getType() in JDK.
TCharacter.isSpaceChar(int) (org.teavm.classlib.java.lang) // known and rarely changed
TCharacter.isUpperCase(int) (org.teavm.classlib.java.lang) // solvable
TCharacter.isTitleCase(int) (org.teavm.classlib.java.lang) // here could be issue, but rare https://stackoverflow.com/questions/31770995/difference-between-uppercase-and-titlecase
TCharacter.isLetter(int) (org.teavm.classlib.java.lang) // probably is the combination of other checks, but inconsistencies include only rare cases
TCharacter.isJavaIdentifierPart(int) (org.teavm.classlib.java.lang) // don't know what it is
TCharacter.isDefined(int) (org.teavm.classlib.java.lang) // suppose that solvable
TCharacter.isUnicodeIdentifierStart(int) (org.teavm.classlib.java.lang) // don't know what it is for now
TCharacter.isJavaIdentifierStart(int) (org.teavm.classlib.java.lang) // don't know what it is for now
TCharacter.isUnicodeIdentifierPart(int) (org.teavm.classlib.java.lang) // don't know what it is for now
TCharacter.isIdentifierIgnorable(int) (org.teavm.classlib.java.lang) // don't know what it is for now
TCharacter.isLowerCase(int) (org.teavm.classlib.java.lang) // solvable
TCharacter.isAlphabetic(int) (org.teavm.classlib.java.lang) // solvable
TCharacter.isDigit(int) (org.teavm.classlib.java.lang) // solvable
TCharacter.isLetterOrDigit(int) (org.teavm.classlib.java.lang) // solvable
So, generally what is unknown is the issue with Java-Unicode modifiers and this TitleCase case (which influences isLetter).
Now normal test run is successful, but it fails in optimized code for some reason. It seems that it's not properly converts String literals.
Did you try to debug the issue?
How to debug JS which is launched from unvisible Chrome? Here are all included parts that cause the fail:
function jl_String_toUpperCaseNative(var$1) {
return $rt_str($rt_ustr(var$1).toUpperCase());
}
...
var$7 = jl_String_toUpperCaseNative($rt_s(8));
...
var$8 = var$7.$hashCode0(); // here is fail
It was wrapped and unwrapped, but error is
java.lang.RuntimeException: null: (JavaScript) TypeError: var$7.$hashCode0 is not a function
rt_s(8) is constant pool constant corresponding to "UTF-8". In non-minified js there is
"$hashCode0", $rt_wrapFunction0(jl_String_hashCode)
which is not present in minified JS. But I don't know the reason why. Test which tests toLowerCase and toUpperCase passes, other tests are failing, for instance: createdFromByteArray (which I'm investigating)
My point is that there are inconsistencies already
Please, don't confise inconsistent and invalid behaviour. TeaVM's class library does not behave exactly according to spec, however, it's internally consistent. What you propose it to make some parts of TeaVM classlib to behave according to spec, and some (which are connected to previous parts) not according to spec. That was my point. The problem here is that it's impossible to achieve consistency, since JS is missing some functionality.
My goal is to have performant String methods and codesize.
As for performance: you can try to fix this PR and measure performance of generated code with and without this patch? Can you investigate JS benchmarking tools like JMH for Java?
How to debug JS which is launched from unvisible Chrome?
You do know where to find these JS files, right? So what's the problem to open them in the browser? BTW, TeaVM already cares about it and puts html files as well. You just open HTML file in your browser and then open debugging tools.
I noticed that you are using $rt_str
and $rt_ustr
as direct text. That's wrong, please find out any of their usages in existing code. There was a special function to write calls to runtime.
java.lang.RuntimeException: null: (JavaScript) TypeError: var$7.$hashCode0 is not a function
Sure. Dependency analyzer knows nothing about internals of your native function. You should hint it by implementing DependencyPlugin
and binding it to method with @PluggableDependency
. Just tell it that the resulting value is java.lang.String
(i.e. call getResult().propagate
)
@konsoletyper many thanks for help. Tests now are working. Added tests for uppercase and lowercase. It seems that most popular case of native upperCase and LowerCase is consistent with JVM Locales.
Just tried simplest possible benchmark.
List<String> strings = IntStream.range(0, 1000000).mapToObj(i -> UUID.randomUUID().toString()).collect(Collectors.toList());
int l = 0;
long time = System.currentTimeMillis();
for (String s : strings) {
l += s.toUpperCase().hashCode();
}
System.out.println((System.currentTimeMillis() - time) + " " + l);
Generated two files classes_native.js and classes_old.js. Then run it it chromium and firefox by replacing js filename.
Results:
|Browser
|old
|new
|
|Chromium
|2900-3300
|2000-2400
|
|Firefox
|2700-2900
|2100-2300
|
Seems that on my computer firefox is faster and produces more stable results. Still, it seems that even this small change produces 25-30% performance improvement.
long time =
?It seems that most popular case of native upperCase and LowerCase is consistent with JVM Locales.
Good, but they aren't consistent with other locale-specific methods of TeaVM
private static final int SAMPLE_SIZE = 100_000;
public static void main(String[] args) {
String[] strings = new String[SAMPLE_SIZE];
generateSample(strings, () -> UUID.randomUUID().toString());
int l = 0;
for (String s : strings) {
l -= s.toUpperCase().hashCode();
}
System.out.println("Warmup: " + l);
generateSample(strings, () -> UUID.randomUUID().toString());
long time = System.currentTimeMillis();
for (String s : strings) {
l += s.toUpperCase().hashCode();
}
System.out.println("UUID strings: " + (System.currentTimeMillis() - time) + ", sink: " + l);
generateSample(strings, () -> UUID.randomUUID().toString().substring(0, ThreadLocalRandom.current().nextInt(1, 5)));
time = System.currentTimeMillis();
for (String s : strings) {
l += s.toUpperCase().hashCode();
}
System.out.println("Short strings: " + (System.currentTimeMillis() - time) + ", sink: " + l);
generateSample(strings, Client::generateLongString);
time = System.currentTimeMillis();
for (String s : strings) {
l += s.toUpperCase().length();
}
System.out.println("Long strings: " + (System.currentTimeMillis() - time) + ", sink: " + l);
generateSample(strings, Client::generateMixed);
time = System.currentTimeMillis();
for (String s : strings) {
l += s.toUpperCase().length();
}
System.out.println("Mixed strings: " + (System.currentTimeMillis() - time) + ", sink: " + l);
}
private static void generateSample(String[] array, Supplier<String> generator) {
for (int i = 0; i < array.length; i++) {
array[i] = generator.get();
}
}
private static String generateLongString() {
StringBuilder builder = new StringBuilder();
int size = ThreadLocalRandom.current().nextInt(90, 100);
for (int i = 0; i < size; i++) {
builder.append(UUID.randomUUID());
}
return builder.toString();
}
private static String generateMixed() {
StringBuilder builder = new StringBuilder();
for (int i = 0; i < 100; i++) {
builder.append(UUID.randomUUID());
}
return builder.substring(0, ThreadLocalRandom.current().nextInt(1, 3500));
}
Old code size: 59,8 KiB
New code size: 55.0 KiB
|Browser
|old uuid
|new uuid
|old short
|new short
|
|Chromium
|283,286,290
|196,141,206
|25,26,26
|83,48,68
|
|Firefox
|283,272,276
|224,236,222
|59,61,59
|94,90,110
|
|Browser
|old long
|new long
|old mixed
|new mixed
|
|Chromium
|10453,10796,10621
|3926,3916,3981
|5410,5622,5554
|2048,2056,2166
|
|Firefox
|20633,20107,21322
|5155,5171,5090
|11154,10286,11401
|2630,2578,2644
|
Seems that old method is faster on short strings (1-5 characters), but new measured values seem to be polluted by some side-effects. Still, on UUID-size (36 chars) new method is already faster and produced measurements are reliable. On long strings new approach is significantly faster (up to 3-4 times).
I just pushed optimization for toLowerCase/toUpperCase, and after it these methods works 2.5 times faster. So performance is no more a reason to apply this PR, at least until there's a way to avoid copying.
Please, wait, don't close this. I have some ideas, may be we'll return back to this PR few weeks later
@konsoletyper I suppose that after adaptation this PR can be reviewed and merged.
@konsoletyper tried to implement native toUpperCase and toLowerCase methods for String. There are few reasons to do this.
For some reason generated methods are not working. I checked that String -> JSString conversion is made just by
return str;
, so I assumed that variable in method is also native String. But for some reason it is not working.