nim-lang / RFCs

A repository for your Nim proposals.
136 stars 23 forks source link

[performance] performance tricks to improve std/json, parsers, lookups #188

Open timotheecour opened 4 years ago

timotheecour commented 4 years ago

as mentioned by @araq we could port/adapt/get inspired by https://github.com/lemire/simdjson to improve performance in several places, eg std/json or other parsers; here's the talk: https://www.youtube.com/watch?v=wlvKAT7SZIQ&list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&index=20&t=0s ; highly recommended

it has a number of distinct useful ideas to improve performance that could be used (not just for json parsing), as reusable components:

links

Araq commented 4 years ago

Best thing would be if manage to map set[char] and loops over them to SIMD instructions.

mratsim commented 4 years ago

If you look in Kostya's benchmarks, the fastest JSON parser seems to be in the D standard library and seems to be 2x faster than simdjson: https://github.com/kostya/benchmarks#json

Language Time, s Memory, MiB Energy, J
GDC fast 0.14 ± 00.00 231.46 ± 00.12 3.37 ± 00.18
Rust Serde Custom 0.19 ± 00.00 108.54 ± 00.07 5.14 ± 00.06
Rust Serde Typed 0.20 ± 00.00 120.33 ± 00.13 5.74 ± 00.73
C++ simdjson 0.31 ± 00.00 496.82 ± 00.92 8.29 ± 00.46
C++ gason 0.32 ± 00.00 313.18 ± 00.09 8.28 ± 00.42
C++ RapidJSON 0.36 ± 00.00 345.17 ± 00.09 11.32 ± 00.29
Java 0.49 ± 00.01 569.25 ± 24.45 14.19 ± 00.74
Scala 0.57 ± 00.00 461.18 ± 04.73 19.27 ± 00.86
Julia JSON3 0.83 ± 00.01 834.65 ± 00.68 22.72 ± 01.15
C++ RapidJSON SAX 0.89 ± 00.00 110.19 ± 00.04 23.35 ± 00.63
Go jsoniter 0.98 ± 00.01 225.83 ± 00.13 26.63 ± 00.76
Node.js 0.98 ± 00.00 293.57 ± 03.03 37.35 ± 01.44
Rust Serde Untyped 1.04 ± 00.08 916.54 ± 00.06 27.79 ± 01.44
PyPy 1.07 ± 00.00 405.69 ± 00.23 29.33 ± 00.85
Perl Cpanel::JSON::XS 1.29 ± 00.02 517.07 ± 00.03 35.33 ± 00.53
Crystal Schema 1.43 ± 00.01 157.33 ± 00.12 39.11 ± 01.87
Crystal Pull 1.44 ± 00.02 128.49 ± 00.03 40.26 ± 02.61
Clojure 1.60 ± 00.04 1542.98 ± 20.03 64.27 ± 01.99
PHP 1.65 ± 00.02 803.43 ± 00.08 42.46 ± 01.94
Crystal 1.74 ± 00.01 503.61 ± 00.03 59.80 ± 02.46
CPython UltraJSON 2.06 ± 00.01 689.01 ± 01.60 66.12 ± 00.57
Go 2.08 ± 00.01 400.68 ± 00.14 58.15 ± 01.63
V GCC 2.08 ± 00.01 591.95 ± 00.05 54.50 ± 00.48
V Clang 2.08 ± 00.01 592.46 ± 00.02 54.88 ± 00.66
Python 2.33 ± 00.01 519.47 ± 00.04 63.84 ± 02.25
C++ json-c 2.70 ± 00.02 1756.22 ± 00.06 70.98 ± 01.54
Nim GCC 2.99 ± 00.01 908.77 ± 00.05 78.37 ± 00.66
Nim Clang 3.06 ± 00.01 909.29 ± 00.04 80.44 ± 01.25
timotheecour commented 4 years ago

[EDIT]

git clone --depth 1 https://github.com/mleise/fast.git

gdc -o json_d_gdc_fast -O3 -frelease test_fast.d fast/source/fast/cstring.d fast/source/fast/buffer.d fast/source/fast/json.d fast/source/fast/parsing.d fast/source/fast/intmath.d fast/source/fast/internal/sysdef.di fast/source/fast/internal/helpers.d fast/source/fast/unicode.d fast/source/fast/internal/unicode_tables.d fast/source/std/simd.d

as you can see, it does use simd, and perhaps other tricks; we need to look into it; in particular: https://github.com/mleise/fast/blob/master/source/fast/json.d

note:

packedjson improves a bit but not much: (and using parseFile over readFile + parseJson improves a tiny bit, but not even used in D)

  import pkg/packedjson
  let file = "/tmp/1.json"
  let jobj = parseFile(file)
  let coordinates = jobj["coordinates"]
  let len = float(coordinates.len)

  var x = 0.0
  var y = 0.0
  var z = 0.0

  for coord in coordinates:
    x += coord["x"].getFloat
    y += coord["y"].getFloat
    z += coord["z"].getFloat

  echo x / len
  echo y / len
  echo z / len

on my machine, this goes from 0:02.63 down to 0:01.59 ie a 1.53X speedup

We need to gprof/profile this, the gap is not good for nim;

more importantly, the optimization opportunities (eg simd) are likely to generalize and benefit to other areas of nim beyond json parsing

Varriount commented 4 years ago

https://core.ac.uk/download/pdf/150449188.pdf

mratsim commented 4 years ago

https://core.ac.uk/download/pdf/150449188.pdf

Wocjiek Mula and Daniel Lemire are the same people who wrote the json parser btw

Varriount commented 4 years ago

Something that might also be of interest is the (unfortunately compiler specific) function multi-versioning that both GCC and Clang provide.

timotheecour commented 4 years ago

but do we need function multi-versioning when we have when ?

Varriount commented 4 years ago

The mechanism behind function multi-versioning happens at runtime, when the system loader is loading an executable and it's libraries.

This allows using a function implementation optimized for a particular instruction set extension, while still providing a fallback for older processors that don't have the instruction set extension.

In contrast, the when mechanism works at compile time. You can use it to compile functions optimized for a particular instruction set extension, however execution of those functions on architectures that do not support that instruction set extension will result in the program crashing.

There is are ways to do this in a more agnostic fashion, however they either involve creating support dlls (compile multiple dlls from one set of code, each targeting a particular set of extensions) or creating object files and linking in some form of runtime dispatch.

mratsim commented 4 years ago

Note: MSVC allow mixing SIMD in binaries and does not require specific flags so causes no issue.

Currently to allow a multiple SIMD paths in the same GCC or Clang executable we have the following options.

Each SIMD target in separate files

The SIMD flag can be passed as a {.localPassC:"-mavx".} pragma for example. This is what I use in Weave: https://github.com/mratsim/weave/blob/v0.3.0/benchmarks/matmul_gemm_blas/gemm_pure_nim/common/gemm_ukernel_avx2.nim#L6

In the past we had to use an undocumented feature of nim.cfg to specify per-flag pragma but it caused deployment issues in Laser/Arraymancer.

Multiple SIMD target in the same file

This one is a bit more tricky at the moment until https://github.com/nim-lang/Nim/issues/10682 is solved.

We would be able to do that:

when defined(gcc) or defined(clang):
  {.pragma: avx2, codegenDecl: "__attribute__((__target__("avx2"))) $# $#$#".}
else:
  {.pragma: avx2.}

proc bar_generic() =
  echo 1

proc bar_avx2() {.avx2.} =
  echo 1

This can already be done but the it overwrites all the N_LIBPRIVATE, N_INLINE that nim adds so you need some macro magic to check if the proc is exported or inline to rebuild the proper codegenDecl.

Dispatch

Unfortunately Nim can rely on GCC function attribute dispatch AFAIK, they require the functions to have the same name in C. AFAIK it's possible via {.exportc.} but Nim complains (to be confirmed, it's been a while).

Alternatively, either the functions have a wrapper that checks the SIMD supported at runtime before calling the proper one. This can be done via:

All 3 dispatching techniques require the same amount of code duplication

SIMD

The SIMD PR here https://github.com/nim-lang/Nim/pull/11816 is well done and I'm replacing my wrapper of Facebook's cpuinfo by it (which is the best lightweight CPU feature detection package, outside of NUMA or HyperThreading detection needs)

c-blake commented 3 years ago

For what it's worth, at least for that kostya benchmark, SIMD is not so critical. See this pr and a test program. I get within 1.1x of D fast and within 1.4x of the Lemire optimized SIMD on demand (>700 MiB/s on my machine). (PGO build is very necessary - 1.5x speed up, as is often the case with Nim generated C.)

The executive summary is just that the current parsejson just does the same work in triplicate. While that last 1.4x might be nice someday de-triplicating should probably be the initial focus.

treeform commented 3 years ago

I have made a faster json serializer and deserializer (in pure Nim, no SIMD) (https://github.com/treeform/jsony). Why is it faster?

Currently the Nim standard module first parses or serializes json into JsonNodes and then turns the JsonNodes into your objects with the to() macro. This is slower and creates unnecessary work for the garbage collector. My library skips the JsonNodes and creates the objects you want directly.

Another speed up comes from not using StringStream. Stream has a function dispatch overhead because it has to be able to switch between StringStream or FileStream at runtime. Jsony skips the overhead and just directly writes to memory buffers.

Parse speed.

name ............................... min time      avg time    std dv  times
treeform/jsony .................... 10.121 ms     10.809 ms    ±1.208   x100
planetis-m/eminim ................. 12.006 ms     12.574 ms    ±1.156   x100
nim std/json ...................... 23.282 ms     34.679 ms    ±7.512   x100

Serialize speed.

name ............................... min time      avg time    std dv  times
treeform/jsony ..................... 4.047 ms      4.172 ms    ±0.225   x100
planetis-m/eminim .................. 7.173 ms      7.324 ms    ±0.253   x100
disruptek/jason ................... 10.220 ms     11.155 ms    ±0.689   x100
nim std/json ...................... 11.526 ms     15.181 ms    ±0.857   x100
mratsim commented 3 years ago

Here is the bench on my machine with nim-json-serialization:

image

However at the moment it brings Chronos and BearSSL as a dependency :/ https://github.com/status-im/nim-json-serialization/issues/25

treeform commented 3 years ago

@mratsim Thank very much for pointing json_serialization library out. I studied the library on how it goes this fast and have implemented things I learned in mine:

Parse speed.

name ............................... min time      avg time    std dv  times
treeform/jsony ..................... 6.724 ms     12.047 ms    ±3.694   x100
status-im/nim-json-serialization ... 7.119 ms     14.276 ms    ±2.033   x100
nim std/json ...................... 24.141 ms     38.741 ms    ±5.417   x100
planetis-m/eminim ................. 10.974 ms     18.355 ms    ±3.994   x100

Serialize speed.

name ............................... min time      avg time    std dv  times
treeform/jsony ..................... 1.531 ms      2.779 ms    ±0.091   x100
status-im/nim-json-serialization ... 2.043 ms      3.448 ms    ±0.746   x100
planetis-m/eminim .................. 5.951 ms      9.305 ms    ±3.210   x100
disruptek/jason ................... 10.312 ms     13.471 ms    ±3.107   x100
nim std/json ...................... 12.551 ms     19.419 ms    ±4.039   x100

Lessons learned:

I hope this is useful for any one who wants to implement fast parsers.

zah commented 3 years ago

@treeform, I hope I'm not too forward, but I'd love to see you contributing to nim-json-serialization. I'm sure you'll be able to deliver similar optimisations there as well, because I haven't even tried to optimise the performance yet.

nim-json-serialization and jsony both share the key architectural trait that makes parsing fast. You skip the dynamic JsonNodes and instead you directly populate a strongly typed object. I'd say that nim-json-serialization has some additional advantages:

1) It's part of the nim-serialization family of libraries conforming to the same standard API. We also have libraries for TOML, ProtoBuf, SSZ and other formats that can all be tested with a single generic test suite defined in nim-serialization.

2) Since we've used the library for a long time in many projects now, it had to learn to handle many tricky aspects of Nim such as case objects, inheritance, when statements in the record fields, etc. What's even better is that this knowledge is embedded in the parent nim-serialization library and as we improve it, it improves all of the formats together. The common serialization API allows many advanced features such as skipped or renamed fields and a lot of options when defining custom serializers.

3) nim-json-serialization builds on top of nim-faststreams instead of working with strings only. This makes the library significantly more versatile and you'll get the same stellar performance when working with files, sockets and other input sources.

I've also took a quick glance at the jsony source code and I can recognize some early mistakes that I did go through while developing nim-json-serialization. For example, if you handle the built-in types with when statements instead of overloads, your users will face significantly less noisy error messages when dealing with overload visibility problems.

dom96 commented 3 years ago

Lessons learned:

A lot of these lessons I would consider "tribal knowledge", I think it would help the whole community if these made their way into a more widely-shared document, perhaps an article on nim-lang.org.

@treeform I would encourage you to do a write up, but I understand if you don't have the time so I created an issue on the website repo for this: https://github.com/nim-lang/website/issues/251. If anyone is interested on doing a write up please say in the issue :)

disruptek commented 3 years ago

There was a good article recently about unintuitive decisions Microsoft made with respect to string comparison algos in Windows. Maybe we should see if we can steal something from that when reimplementing strutils.

timotheecour commented 3 years ago

a few good reusable techniques at play here: https://github.com/samuell/gccontent-benchmark/pull/22

timotheecour commented 3 years ago

https://github.com/nim-lang/Nim/pull/18183 shows how a trivial code change can give a 20x performance speedup (in jsonutils deserialization), by using a template instead of a proc which allows lazy param evaluation in cases like this:

checkJson ok, $(json.len, num, numMatched, $T, json)
  # The 2nd param `msg` is only evaluated on failure if checkJson is a template, but not if it's a proc