[preview] String parsing improvements

jrudolph commented 5 years ago

String parsing dominates parsing times so far for real world input (that isn't particularly number heavy). So, here's a first attempt to improve performance.

The speed-ups mostly come from

no explicit bounds checking
removing tricks
some code reorganization

Before:

GithubIssuesJsonAST.readCirce                github-akka-issues.json 2465.716 ± 191.574  ops/s
GithubIssuesJsonAST.readCirce                  big-string-array.json 3375.916 ± 317.283  ops/s
GithubIssuesJsonAST.readPlayJson             github-akka-issues.json 2002.679 ± 159.931  ops/s
GithubIssuesJsonAST.readPlayJson               big-string-array.json 2790.919 ±  59.378  ops/s

GithubIssuesJsonAST.readSprayJsonFromBytes   github-akka-issues.json 1048.052 ±  12.170  ops/s
GithubIssuesJsonAST.readSprayJsonFromBytes     big-string-array.json 1353.650 ±  27.423  ops/s
GithubIssuesJsonAST.readSprayJsonFromString  github-akka-issues.json 1202.670 ±  28.346  ops/s
GithubIssuesJsonAST.readSprayJsonFromString    big-string-array.json 1507.546 ±  19.228  ops/s

GithubIssuesCaseClassReading.readCircePartial           2074.334 ± 463.204  ops/s
GithubIssuesCaseClassReading.readJacksonPartial         3393.928 ± 166.386  ops/s
GithubIssuesCaseClassReading.readPlayJsonPartial        1883.332 ± 182.801  ops/s

GithubIssuesCaseClassReading.readSprayJsonViaASTPartial 1141.810 ± 3.432  ops/s

After:

GithubIssuesJsonAST.readSprayJsonFromBytes   github-akka-issues.json 2321.173 ±  74.728  ops/s
GithubIssuesJsonAST.readSprayJsonFromBytes     big-string-array.json 4250.720 ± 187.668  ops/s
GithubIssuesJsonAST.readSprayJsonFromString  github-akka-issues.json 2181.839 ±  60.334  ops/s
GithubIssuesJsonAST.readSprayJsonFromString    big-string-array.json 3447.510 ±  27.173  ops/s

GithubIssuesCaseClassReading.readSprayJsonViaASTPartial  2142.250 ± 99.059  ops/s

I.e. about 2x-3x faster for byte input. So, string parsing has been improved by ~ 3x which leads to a 2x improvement for the test case (88k json from github API).

As ParserInput needs to be changed, the change won't be simply binary compatible. I'll probably deprecate the current parser infrastructure and duplicate it by simpler entry-points that don't leak so many implementation details for common cases.

sirthias commented 5 years ago

Awesome stuff, @jrudolph! These are great improvements!! The old "tricks" really don't pull their weight, especially on the newer JVMs.

sirthias commented 5 years ago

One thing that might be worthwhile exploring to speed up string parsing even more is unrolling the parseOneFast loop to handle 4 characters in a single loop iteration. I could imagine that we getter instruction level parallelism this way.

The beginning of this blog post here: https://chadaustin.me/2017/05/writing-a-really-really-fast-json-parser/ discusses this effect for JSON parsing.

In the end it's all about reducing the average parsing cycles per character, so any parallelization that we can get will help.

Another thing could be sth similar to this line of borer's JSON parser: https://github.com/sirthias/borer/blob/master/core/src/main/scala/io/bullet/borer/json/JsonParser.scala#L417

Decoupling the character copying from the rest of the loop had an incredible effect on performance in borer's case. I suspect because the CPU can then freely move those instructions between everything else and make use of otherwise unused silicon capacity, so a lot of it gets essentially moved out of the critical path and costs "nothing" any more...

plokhotnyuk commented 5 years ago

@sirthias I have found that C2 can do unrolling for some simple loops, and for my case only GraalVM required for manual unrolling

sirthias commented 5 years ago

Yeah, the JVM might be able to do it itself. But this is hard to control and small changes might break the JVMs ability to do it. So manual unrolling for this very hot code path appears to be worth the effort.

plokhotnyuk commented 5 years ago

Here are throughput scores of benchmarks which parse a string of 128 ASCII chars on the same Intel Core i7 CPU but on different versions of JDK and GraalVM:

It is much harder to keep them high on different hardware.

jrudolph commented 5 years ago

One thing that might be worthwhile exploring to speed up string parsing even more is unrolling the parseOneFast loop to handle 4 characters in a single loop iteration. I could imagine that we getter instruction level parallelism this way

Tried that in different variations but no positive results so far.

Decoupling the character copying from the rest of the loop had an incredible effect on performance in borer's case.

Tried that as well in different versions but I also didn't find any clear wins. What helped indeed was moving the copying of the single character out of the branches.

I also experimented with using table lookups which also didn't help :)

Regarding string parsing benchmarks in particular, I found that the biggest factor that will bias the result is the size of the internal buffer for collecting the characters. If every parser run will reallocate that buffer you can only get max performance in the benchmark if the buffer is already sized just big enough for the resulting string. You can probably improve a lot in these benchmarks when you bring-your-own-buffer.

There's also the problem, that JIT often doesn't end up in the same state and I've often seen +/- 20% between different runs with 2 or 3 different steady states over different runs.

Here are the benchmark results from jsoniter-scala when parsing 1000 bytes with a 1024 char buffer:

# VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04
# VM invoker: /usr/lib/jvm/adoptopenjdk-8-jdk-hotspot/jre/bin/java
# VM options: -server -Xms2g -Xmx2g -XX:NewSize=1g -XX:MaxNewSize=1g -XX:InitialCodeCacheSize=512m -XX:ReservedCodeCacheSize=512m -XX:+UseParallelGC -XX:-UseBiasedLocking -XX:+AlwaysPreTouch

Benchmark                                      (size)      Score       Error  Units
StringOfAsciiCharsReading.avSystemGenCodec       1000 420377.975 ±   3314.706  ops/s
StringOfAsciiCharsReading.borerJson              1000 374747.750 ±   5469.807  ops/s
StringOfAsciiCharsReading.circe                  1000 296227.975 ±   1608.166  ops/s
StringOfAsciiCharsReading.dslJsonScala           1000 747348.589 ±  14188.010  ops/s
StringOfAsciiCharsReading.jacksonScala           1000 668431.496 ±  10057.838  ops/s
StringOfAsciiCharsReading.jsoniterJava           1000 663563.884 ±   4021.098  ops/s
StringOfAsciiCharsReading.jsoniterScala          1000 799233.199 ±   3866.261  ops/s
StringOfAsciiCharsReading.playJson               1000 697148.819 ±  13384.538  ops/s
StringOfAsciiCharsReading.scalikeJackson         1000 383223.384 ±   1653.179  ops/s
StringOfAsciiCharsReading.sprayJson              1000 516299.306 ±  12754.457  ops/s
StringOfAsciiCharsReading.uPickle                1000 396939.986 ±   8720.495  ops/s

StringOfEscapedCharsReading.avSystemGenCodec     1000  40351.801 ±  2746.862  ops/s
StringOfEscapedCharsReading.borerJson            1000  56583.767 ±  3943.939  ops/s
StringOfEscapedCharsReading.circe                1000  33025.217 ±  1741.437  ops/s
StringOfEscapedCharsReading.dslJsonScala         1000  71626.399 ±  2489.666  ops/s
StringOfEscapedCharsReading.jacksonScala         1000  60839.264 ±   687.483  ops/s
StringOfEscapedCharsReading.jsoniterJava         1000  99722.853 ± 14966.104  ops/s
StringOfEscapedCharsReading.jsoniterScala        1000 185767.529 ±  4066.498  ops/s
StringOfEscapedCharsReading.playJson             1000  59031.123 ±  6318.695  ops/s
StringOfEscapedCharsReading.scalikeJackson       1000  49066.981 ±   810.482  ops/s
StringOfEscapedCharsReading.sprayJson            1000  81830.164 ±  3713.948  ops/s
StringOfEscapedCharsReading.uPickle              1000  34547.016 ±   730.916  ops/s

StringOfNonAsciiCharsReading.avSystemGenCodec    1000 126835.981 ±  5900.099  ops/s
StringOfNonAsciiCharsReading.borerJson           1000  81477.660 ±  2066.403  ops/s
StringOfNonAsciiCharsReading.circe               1000 103151.358 ±  3512.299  ops/s
StringOfNonAsciiCharsReading.dslJsonScala        1000 189423.431 ±  1927.495  ops/s
StringOfNonAsciiCharsReading.jacksonScala        1000 101410.377 ± 19933.428  ops/s
StringOfNonAsciiCharsReading.jsoniterJava        1000 267360.452 ±  5517.855  ops/s
StringOfNonAsciiCharsReading.jsoniterScala       1000 193001.124 ±  2849.416  ops/s
StringOfNonAsciiCharsReading.playJson            1000 116526.373 ±  4094.327  ops/s
StringOfNonAsciiCharsReading.scalikeJackson      1000 121490.223 ±  6358.179  ops/s
StringOfNonAsciiCharsReading.sprayJson           1000 215865.902 ± 13995.871  ops/s
StringOfNonAsciiCharsReading.uPickle             1000 103967.342 ±  3001.080  ops/s

Which are now very decent results for this PR given that there are no tricks involved.

jrudolph commented 5 years ago

It is much harder to keep them high on different hardware.

I think, I'll keep spray-json reasonably simple, so I'll rather go with a somewhat simple solution even if that costs a bit of performance. That also has the benefit that it hasn't been optimized to any architecture and should still show reasonable (but maybe not top) performance.

jrudolph commented 5 years ago

You can probably improve a lot in these benchmarks when you bring-your-own-buffer.

Caching the buffer in a thread local between parsing runs is another 10-20% win indeed.

sirthias commented 5 years ago

Great stuff! Looking forward to seeing this released!

sirthias commented 5 years ago

Caching the buffer in a thread local between parsing runs is another 10-20% win indeed.

Interesting. In my tests with borer adding a ThreadLocal cache of the char buffer somehow resulted in a slight performance decrease, at least on graal. This perf work really is a complete trial and error process... unfortunately.

jrudolph commented 5 years ago

This perf work really is a complete trial and error process... unfortunately.

It is, I'm also still looking for a more predictable workflow. I didn't test extensively on Graal, from what I've seen, GraalCE was slower than hotspot and JDK11 was a bit faster than JDK8.

spray / spray-json

[preview] String parsing improvements #314