[performance] performance tricks to improve std/json, parsers, lookups

timotheecour commented 4 years ago

as mentioned by @araq we could port/adapt/get inspired by https://github.com/lemire/simdjson to improve performance in several places, eg std/json or other parsers; here's the talk: https://www.youtube.com/watch?v=wlvKAT7SZIQ&list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&index=20&t=0s ; highly recommended

it has a number of distinct useful ideas to improve performance that could be used (not just for json parsing), as reusable components:

avoiding branching (to avoid branch misprediction) using clever tricks
using SIMD, AVX-512 etc
adding performance regression tests (we don't do that IIRC in testament)
UTF8 validation eg: _mm256_subs_epu8(current_bytes, 244) has no branches, 1 instruction checks 32 bytes at once
dynamic runtime dispatch (with caching of which function pointer to use) to pick the most optimized function for a given processor/architecture, without having to recompile
caveats regarding how to evaluate performance where branch predictions are involved: when using random data with a fixed seed (even with say 2000 iterations), the CPU might "learn" to perfectly do branch prediction but that won't reflect performance on real data where there is no fixed pattern
ARM neon and x64 processors have instructions to lookup 16 byte tables in a vectorized manner (16 values at a time): pshufb, tbl => could that be used to optimize ~~set[int]~~ [EDIT] set[char] operations ?
fast number parsing (number parsing is expensive: 10 branch misses per FP number) => https://youtu.be/wlvKAT7SZIQ?list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&t=2368
etc...

links

https://github.com/nim-lang/Nim/issues/12152 (eg https://github.com/nim-lang/Nim/issues/12152#issuecomment-569160164 for benchmark)
https://github.com/nim-lang/Nim/issues/12833
https://github.com/Araq/packedjson
https://github.com/lemire/simdjson
pdf for simdjson: https://arxiv.org/pdf/1902.08318.pdf
video for simdjson: https://www.youtube.com/watch?v=wlvKAT7SZIQ&list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&index=20&t=0s
simdjson benchmark: https://github.com/lemire/simdjson/blob/master/doc/gbps.png
nim: improve build time by 1.5x by timotheecour · Pull Request #11 · nordlow/compiler-benchmark

Araq commented 4 years ago

Best thing would be if manage to map set[char] and loops over them to SIMD instructions.

mratsim commented 4 years ago

If you look in Kostya's benchmarks, the fastest JSON parser seems to be in the D standard library and seems to be 2x faster than simdjson: https://github.com/kostya/benchmarks#json

Language	Time, s	Memory, MiB	Energy, J
GDC fast	0.14 ± 00.00	231.46 ± 00.12	3.37 ± 00.18
Rust Serde Custom	0.19 ± 00.00	108.54 ± 00.07	5.14 ± 00.06
Rust Serde Typed	0.20 ± 00.00	120.33 ± 00.13	5.74 ± 00.73
C++ simdjson	0.31 ± 00.00	496.82 ± 00.92	8.29 ± 00.46
C++ gason	0.32 ± 00.00	313.18 ± 00.09	8.28 ± 00.42
C++ RapidJSON	0.36 ± 00.00	345.17 ± 00.09	11.32 ± 00.29
Java	0.49 ± 00.01	569.25 ± 24.45	14.19 ± 00.74
Scala	0.57 ± 00.00	461.18 ± 04.73	19.27 ± 00.86
Julia JSON3	0.83 ± 00.01	834.65 ± 00.68	22.72 ± 01.15
C++ RapidJSON SAX	0.89 ± 00.00	110.19 ± 00.04	23.35 ± 00.63
Go jsoniter	0.98 ± 00.01	225.83 ± 00.13	26.63 ± 00.76
Node.js	0.98 ± 00.00	293.57 ± 03.03	37.35 ± 01.44
Rust Serde Untyped	1.04 ± 00.08	916.54 ± 00.06	27.79 ± 01.44
PyPy	1.07 ± 00.00	405.69 ± 00.23	29.33 ± 00.85
Perl Cpanel::JSON::XS	1.29 ± 00.02	517.07 ± 00.03	35.33 ± 00.53
Crystal Schema	1.43 ± 00.01	157.33 ± 00.12	39.11 ± 01.87
Crystal Pull	1.44 ± 00.02	128.49 ± 00.03	40.26 ± 02.61
Clojure	1.60 ± 00.04	1542.98 ± 20.03	64.27 ± 01.99
PHP	1.65 ± 00.02	803.43 ± 00.08	42.46 ± 01.94
Crystal	1.74 ± 00.01	503.61 ± 00.03	59.80 ± 02.46
CPython UltraJSON	2.06 ± 00.01	689.01 ± 01.60	66.12 ± 00.57
Go	2.08 ± 00.01	400.68 ± 00.14	58.15 ± 01.63
V GCC	2.08 ± 00.01	591.95 ± 00.05	54.50 ± 00.48
V Clang	2.08 ± 00.01	592.46 ± 00.02	54.88 ± 00.66
Python	2.33 ± 00.01	519.47 ± 00.04	63.84 ± 02.25
C++ json-c	2.70 ± 00.02	1756.22 ± 00.06	70.98 ± 01.54
Nim GCC	2.99 ± 00.01	908.77 ± 00.05	78.37 ± 00.66
Nim Clang	3.06 ± 00.01	909.29 ± 00.04	80.44 ± 01.25

timotheecour commented 4 years ago

[EDIT]

D's stdlib implementation: https://github.com/dlang/phobos/blob/master/std/json.d for reference
note that LDC, GDC, DMD perform worse than nim (see https://github.com/kostya/benchmarks#json) , however GDC FAST performs a lot faster; looking more closely into it (see https://github.com/kostya/benchmarks/blob/master/json/build.sh), it's because of this specialized library: https://github.com/mleise/fast.git

git clone --depth 1 https://github.com/mleise/fast.git

gdc -o json_d_gdc_fast -O3 -frelease test_fast.d fast/source/fast/cstring.d fast/source/fast/buffer.d fast/source/fast/json.d fast/source/fast/parsing.d fast/source/fast/intmath.d fast/source/fast/internal/sysdef.di fast/source/fast/internal/helpers.d fast/source/fast/unicode.d fast/source/fast/internal/unicode_tables.d fast/source/std/simd.d

as you can see, it does use simd, and perhaps other tricks; we need to look into it; in particular: https://github.com/mleise/fast/blob/master/source/fast/json.d

I tried with ldc + fast (no easy access to gdc on osx) and it gave on my machine (with ldmd2 -O5 -release -inline -boundscheck=off): 0.27, ie a factor 6.77 speedup over nim using std/json, but I expect gdc to be even fast according to benchmark

note:

packedjson improves a bit but not much: (and using parseFile over readFile + parseJson improves a tiny bit, but not even used in D)

  import pkg/packedjson
  let file = "/tmp/1.json"
  let jobj = parseFile(file)
  let coordinates = jobj["coordinates"]
  let len = float(coordinates.len)

  var x = 0.0
  var y = 0.0
  var z = 0.0

  for coord in coordinates:
    x += coord["x"].getFloat
    y += coord["y"].getFloat
    z += coord["z"].getFloat

  echo x / len
  echo y / len
  echo z / len

on my machine, this goes from 0:02.63 down to 0:01.59 ie a 1.53X speedup

We need to gprof/profile this, the gap is not good for nim;

more importantly, the optimization opportunities (eg simd) are likely to generalize and benefit to other areas of nim beyond json parsing

Varriount commented 4 years ago

https://core.ac.uk/download/pdf/150449188.pdf

mratsim commented 4 years ago

https://core.ac.uk/download/pdf/150449188.pdf

Wocjiek Mula and Daniel Lemire are the same people who wrote the json parser btw

Varriount commented 4 years ago

Something that might also be of interest is the (unfortunately compiler specific) function multi-versioning that both GCC and Clang provide.

timotheecour commented 4 years ago

but do we need function multi-versioning when we have when ?

Varriount commented 4 years ago

The mechanism behind function multi-versioning happens at runtime, when the system loader is loading an executable and it's libraries.

This allows using a function implementation optimized for a particular instruction set extension, while still providing a fallback for older processors that don't have the instruction set extension.

In contrast, the when mechanism works at compile time. You can use it to compile functions optimized for a particular instruction set extension, however execution of those functions on architectures that do not support that instruction set extension will result in the program crashing.

There is are ways to do this in a more agnostic fashion, however they either involve creating support dlls (compile multiple dlls from one set of code, each targeting a particular set of extensions) or creating object files and linking in some form of runtime dispatch.

mratsim commented 4 years ago

Note: MSVC allow mixing SIMD in binaries and does not require specific flags so causes no issue.

Currently to allow a multiple SIMD paths in the same GCC or Clang executable we have the following options.

Each SIMD target in separate files

The SIMD flag can be passed as a {.localPassC:"-mavx".} pragma for example. This is what I use in Weave: https://github.com/mratsim/weave/blob/v0.3.0/benchmarks/matmul_gemm_blas/gemm_pure_nim/common/gemm_ukernel_avx2.nim#L6

In the past we had to use an undocumented feature of nim.cfg to specify per-flag pragma but it caused deployment issues in Laser/Arraymancer.

Multiple SIMD target in the same file

This one is a bit more tricky at the moment until https://github.com/nim-lang/Nim/issues/10682 is solved.

We would be able to do that:

when defined(gcc) or defined(clang):
  {.pragma: avx2, codegenDecl: "__attribute__((__target__("avx2"))) $# $#$#".}
else:
  {.pragma: avx2.}

proc bar_generic() =
  echo 1

proc bar_avx2() {.avx2.} =
  echo 1

This can already be done but the it overwrites all the N_LIBPRIVATE, N_INLINE that nim adds so you need some macro magic to check if the proc is exported or inline to rebuild the proper codegenDecl.

Dispatch

Unfortunately Nim can rely on GCC function attribute dispatch AFAIK, they require the functions to have the same name in C. AFAIK it's possible via {.exportc.} but Nim complains (to be confirmed, it's been a while).

Alternatively, either the functions have a wrapper that checks the SIMD supported at runtime before calling the proper one. This can be done via:

a dispatch table (similar to the low-level implementation of async dispatch), caveat, it prevents inlining and is a guaranteed cache miss due to dereference.
a switch dispatch, which should be very easily predictable by the processor since it takes the same branch every single time.
"pipeline-level" dispatch which means that within a processing pipelines the implementation is selected by a static enum on function entry (say json parsing and json exporting). Example: https://github.com/mratsim/weave/blob/v0.3.0/benchmarks/matmul_gemm_blas/gemm_pure_nim/gemm_weave.nim#L217-L254 with
```
type CPUFeatureX86* = enum
x86_Generic,
x86_SSE,
x86_SSE2,
x86_SSE4_1,
x86_AVX,
x86_AVX_FMA,
x86_AVX2,
```

All 3 dispatching techniques require the same amount of code duplication

SIMD

The SIMD PR here https://github.com/nim-lang/Nim/pull/11816 is well done and I'm replacing my wrapper of Facebook's cpuinfo by it (which is the best lightweight CPU feature detection package, outside of NUMA or HyperThreading detection needs)

c-blake commented 3 years ago

For what it's worth, at least for that kostya benchmark, SIMD is not so critical. See this pr and a test program. I get within 1.1x of D fast and within 1.4x of the Lemire optimized SIMD on demand (>700 MiB/s on my machine). (PGO build is very necessary - 1.5x speed up, as is often the case with Nim generated C.)

The executive summary is just that the current parsejson just does the same work in triplicate. While that last 1.4x might be nice someday de-triplicating should probably be the initial focus.

treeform commented 3 years ago

I have made a faster json serializer and deserializer (in pure Nim, no SIMD) (https://github.com/treeform/jsony). Why is it faster?

Currently the Nim standard module first parses or serializes json into JsonNodes and then turns the JsonNodes into your objects with the to() macro. This is slower and creates unnecessary work for the garbage collector. My library skips the JsonNodes and creates the objects you want directly.

Another speed up comes from not using StringStream. Stream has a function dispatch overhead because it has to be able to switch between StringStream or FileStream at runtime. Jsony skips the overhead and just directly writes to memory buffers.

Parse speed.

name ............................... min time      avg time    std dv  times
treeform/jsony .................... 10.121 ms     10.809 ms    ±1.208   x100
planetis-m/eminim ................. 12.006 ms     12.574 ms    ±1.156   x100
nim std/json ...................... 23.282 ms     34.679 ms    ±7.512   x100

Serialize speed.

name ............................... min time      avg time    std dv  times
treeform/jsony ..................... 4.047 ms      4.172 ms    ±0.225   x100
planetis-m/eminim .................. 7.173 ms      7.324 ms    ±0.253   x100
disruptek/jason ................... 10.220 ms     11.155 ms    ±0.689   x100
nim std/json ...................... 11.526 ms     15.181 ms    ±0.857   x100

mratsim commented 3 years ago

Here is the bench on my machine with nim-json-serialization:

However at the moment it brings Chronos and BearSSL as a dependency :/ https://github.com/status-im/nim-json-serialization/issues/25

treeform commented 3 years ago

@mratsim Thank very much for pointing json_serialization library out. I studied the library on how it goes this fast and have implemented things I learned in mine:

Parse speed.

name ............................... min time      avg time    std dv  times
treeform/jsony ..................... 6.724 ms     12.047 ms    ±3.694   x100
status-im/nim-json-serialization ... 7.119 ms     14.276 ms    ±2.033   x100
nim std/json ...................... 24.141 ms     38.741 ms    ±5.417   x100
planetis-m/eminim ................. 10.974 ms     18.355 ms    ±3.994   x100

Serialize speed.

name ............................... min time      avg time    std dv  times
treeform/jsony ..................... 1.531 ms      2.779 ms    ±0.091   x100
status-im/nim-json-serialization ... 2.043 ms      3.448 ms    ±0.746   x100
planetis-m/eminim .................. 5.951 ms      9.305 ms    ±3.210   x100
disruptek/jason ................... 10.312 ms     13.471 ms    ±3.107   x100
nim std/json ...................... 12.551 ms     19.419 ms    ±4.039   x100

Lessons learned:

Its all about moving memory in optimal way and not using allocators for strings.
Not using parseInt and $ was a big win. I had to write a print and parse numbers into a string with index. This is some thing that nim should probably provide. ( https://github.com/treeform/jsony/blob/master/src/jsony.nim#L471 )
Compares a string or a slice is also a lot slower then just comparing stuff character by character. ( https://github.com/treeform/jsony/blob/master/src/jsony.nim#L276 )
.add on a string can be really slow. Its faster to grow the string only once, then fill the string, then cap it off to right length. (https://github.com/treeform/jsony/blob/master/src/jsony.nim#L524)
Using static compile time strings for keys was also a speedup.
template is always faster then {.inline.} proc even if they both just return a string.

I hope this is useful for any one who wants to implement fast parsers.

zah commented 3 years ago

@treeform, I hope I'm not too forward, but I'd love to see you contributing to nim-json-serialization. I'm sure you'll be able to deliver similar optimisations there as well, because I haven't even tried to optimise the performance yet.

nim-json-serialization and jsony both share the key architectural trait that makes parsing fast. You skip the dynamic JsonNodes and instead you directly populate a strongly typed object. I'd say that nim-json-serialization has some additional advantages:

1) It's part of the nim-serialization family of libraries conforming to the same standard API. We also have libraries for TOML, ProtoBuf, SSZ and other formats that can all be tested with a single generic test suite defined in nim-serialization.

2) Since we've used the library for a long time in many projects now, it had to learn to handle many tricky aspects of Nim such as case objects, inheritance, when statements in the record fields, etc. What's even better is that this knowledge is embedded in the parent nim-serialization library and as we improve it, it improves all of the formats together. The common serialization API allows many advanced features such as skipped or renamed fields and a lot of options when defining custom serializers.

3) nim-json-serialization builds on top of nim-faststreams instead of working with strings only. This makes the library significantly more versatile and you'll get the same stellar performance when working with files, sockets and other input sources.

I've also took a quick glance at the jsony source code and I can recognize some early mistakes that I did go through while developing nim-json-serialization. For example, if you handle the built-in types with when statements instead of overloads, your users will face significantly less noisy error messages when dealing with overload visibility problems.

dom96 commented 3 years ago

Lessons learned:

A lot of these lessons I would consider "tribal knowledge", I think it would help the whole community if these made their way into a more widely-shared document, perhaps an article on nim-lang.org.

@treeform I would encourage you to do a write up, but I understand if you don't have the time so I created an issue on the website repo for this: https://github.com/nim-lang/website/issues/251. If anyone is interested on doing a write up please say in the issue :)

disruptek commented 3 years ago

There was a good article recently about unintuitive decisions Microsoft made with respect to string comparison algos in Windows. Maybe we should see if we can steal something from that when reimplementing strutils.

timotheecour commented 3 years ago

a few good reusable techniques at play here: https://github.com/samuell/gccontent-benchmark/pull/22

memfile.open instead of open
zero-copy lines iterator
replace case c of ... by array[low(char)..high(char), T] where possible
bit packing

timotheecour commented 3 years ago

https://github.com/nim-lang/Nim/pull/18183 shows how a trivial code change can give a 20x performance speedup (in jsonutils deserialization), by using a template instead of a proc which allows lazy param evaluation in cases like this:

checkJson ok, $(json.len, num, numMatched, $T, json)
  # The 2nd param `msg` is only evaluated on failure if checkJson is a template, but not if it's a proc

nim-lang / RFCs