Open jrudolph opened 5 years ago
Awesome stuff, @jrudolph! These are great improvements!! The old "tricks" really don't pull their weight, especially on the newer JVMs.
One thing that might be worthwhile exploring to speed up string parsing even more is unrolling the parseOneFast
loop to handle 4 characters in a single loop iteration.
I could imagine that we getter instruction level parallelism this way.
The beginning of this blog post here: https://chadaustin.me/2017/05/writing-a-really-really-fast-json-parser/ discusses this effect for JSON parsing.
In the end it's all about reducing the average parsing cycles per character, so any parallelization that we can get will help.
Another thing could be sth similar to this line of borer's JSON parser: https://github.com/sirthias/borer/blob/master/core/src/main/scala/io/bullet/borer/json/JsonParser.scala#L417
Decoupling the character copying from the rest of the loop had an incredible effect on performance in borer's case. I suspect because the CPU can then freely move those instructions between everything else and make use of otherwise unused silicon capacity, so a lot of it gets essentially moved out of the critical path and costs "nothing" any more...
@sirthias I have found that C2 can do unrolling for some simple loops, and for my case only GraalVM required for manual unrolling
Yeah, the JVM might be able to do it itself. But this is hard to control and small changes might break the JVMs ability to do it. So manual unrolling for this very hot code path appears to be worth the effort.
Here are throughput scores of benchmarks which parse a string of 128 ASCII chars on the same Intel Core i7 CPU but on different versions of JDK and GraalVM:
It is much harder to keep them high on different hardware.
One thing that might be worthwhile exploring to speed up string parsing even more is unrolling the
parseOneFast
loop to handle 4 characters in a single loop iteration. I could imagine that we getter instruction level parallelism this way
Tried that in different variations but no positive results so far.
Decoupling the character copying from the rest of the loop had an incredible effect on performance in borer's case.
Tried that as well in different versions but I also didn't find any clear wins. What helped indeed was moving the copying of the single character out of the branches.
I also experimented with using table lookups which also didn't help :)
Regarding string parsing benchmarks in particular, I found that the biggest factor that will bias the result is the size of the internal buffer for collecting the characters. If every parser run will reallocate that buffer you can only get max performance in the benchmark if the buffer is already sized just big enough for the resulting string. You can probably improve a lot in these benchmarks when you bring-your-own-buffer.
There's also the problem, that JIT often doesn't end up in the same state and I've often seen +/- 20% between different runs with 2 or 3 different steady states over different runs.
Here are the benchmark results from jsoniter-scala when parsing 1000 bytes with a 1024 char buffer:
# VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04
# VM invoker: /usr/lib/jvm/adoptopenjdk-8-jdk-hotspot/jre/bin/java
# VM options: -server -Xms2g -Xmx2g -XX:NewSize=1g -XX:MaxNewSize=1g -XX:InitialCodeCacheSize=512m -XX:ReservedCodeCacheSize=512m -XX:+UseParallelGC -XX:-UseBiasedLocking -XX:+AlwaysPreTouch
Benchmark (size) Score Error Units
StringOfAsciiCharsReading.avSystemGenCodec 1000 420377.975 ± 3314.706 ops/s
StringOfAsciiCharsReading.borerJson 1000 374747.750 ± 5469.807 ops/s
StringOfAsciiCharsReading.circe 1000 296227.975 ± 1608.166 ops/s
StringOfAsciiCharsReading.dslJsonScala 1000 747348.589 ± 14188.010 ops/s
StringOfAsciiCharsReading.jacksonScala 1000 668431.496 ± 10057.838 ops/s
StringOfAsciiCharsReading.jsoniterJava 1000 663563.884 ± 4021.098 ops/s
StringOfAsciiCharsReading.jsoniterScala 1000 799233.199 ± 3866.261 ops/s
StringOfAsciiCharsReading.playJson 1000 697148.819 ± 13384.538 ops/s
StringOfAsciiCharsReading.scalikeJackson 1000 383223.384 ± 1653.179 ops/s
StringOfAsciiCharsReading.sprayJson 1000 516299.306 ± 12754.457 ops/s
StringOfAsciiCharsReading.uPickle 1000 396939.986 ± 8720.495 ops/s
StringOfEscapedCharsReading.avSystemGenCodec 1000 40351.801 ± 2746.862 ops/s
StringOfEscapedCharsReading.borerJson 1000 56583.767 ± 3943.939 ops/s
StringOfEscapedCharsReading.circe 1000 33025.217 ± 1741.437 ops/s
StringOfEscapedCharsReading.dslJsonScala 1000 71626.399 ± 2489.666 ops/s
StringOfEscapedCharsReading.jacksonScala 1000 60839.264 ± 687.483 ops/s
StringOfEscapedCharsReading.jsoniterJava 1000 99722.853 ± 14966.104 ops/s
StringOfEscapedCharsReading.jsoniterScala 1000 185767.529 ± 4066.498 ops/s
StringOfEscapedCharsReading.playJson 1000 59031.123 ± 6318.695 ops/s
StringOfEscapedCharsReading.scalikeJackson 1000 49066.981 ± 810.482 ops/s
StringOfEscapedCharsReading.sprayJson 1000 81830.164 ± 3713.948 ops/s
StringOfEscapedCharsReading.uPickle 1000 34547.016 ± 730.916 ops/s
StringOfNonAsciiCharsReading.avSystemGenCodec 1000 126835.981 ± 5900.099 ops/s
StringOfNonAsciiCharsReading.borerJson 1000 81477.660 ± 2066.403 ops/s
StringOfNonAsciiCharsReading.circe 1000 103151.358 ± 3512.299 ops/s
StringOfNonAsciiCharsReading.dslJsonScala 1000 189423.431 ± 1927.495 ops/s
StringOfNonAsciiCharsReading.jacksonScala 1000 101410.377 ± 19933.428 ops/s
StringOfNonAsciiCharsReading.jsoniterJava 1000 267360.452 ± 5517.855 ops/s
StringOfNonAsciiCharsReading.jsoniterScala 1000 193001.124 ± 2849.416 ops/s
StringOfNonAsciiCharsReading.playJson 1000 116526.373 ± 4094.327 ops/s
StringOfNonAsciiCharsReading.scalikeJackson 1000 121490.223 ± 6358.179 ops/s
StringOfNonAsciiCharsReading.sprayJson 1000 215865.902 ± 13995.871 ops/s
StringOfNonAsciiCharsReading.uPickle 1000 103967.342 ± 3001.080 ops/s
Which are now very decent results for this PR given that there are no tricks involved.
It is much harder to keep them high on different hardware.
I think, I'll keep spray-json reasonably simple, so I'll rather go with a somewhat simple solution even if that costs a bit of performance. That also has the benefit that it hasn't been optimized to any architecture and should still show reasonable (but maybe not top) performance.
You can probably improve a lot in these benchmarks when you bring-your-own-buffer.
Caching the buffer in a thread local between parsing runs is another 10-20% win indeed.
Great stuff! Looking forward to seeing this released!
Caching the buffer in a thread local between parsing runs is another 10-20% win indeed.
Interesting. In my tests with borer adding a ThreadLocal cache of the char buffer somehow resulted in a slight performance decrease, at least on graal. This perf work really is a complete trial and error process... unfortunately.
This perf work really is a complete trial and error process... unfortunately.
It is, I'm also still looking for a more predictable workflow. I didn't test extensively on Graal, from what I've seen, GraalCE was slower than hotspot and JDK11 was a bit faster than JDK8.
String parsing dominates parsing times so far for real world input (that isn't particularly number heavy). So, here's a first attempt to improve performance.
The speed-ups mostly come from
Before:
After:
I.e. about 2x-3x faster for byte input. So, string parsing has been improved by ~ 3x which leads to a 2x improvement for the test case (88k json from github API).
As
ParserInput
needs to be changed, the change won't be simply binary compatible. I'll probably deprecate the current parser infrastructure and duplicate it by simpler entry-points that don't leak so many implementation details for common cases.