Open timotheecour opened 4 years ago
Best thing would be if manage to map set[char]
and loops over them to SIMD instructions.
If you look in Kostya's benchmarks, the fastest JSON parser seems to be in the D standard library and seems to be 2x faster than simdjson: https://github.com/kostya/benchmarks#json
Language | Time, s | Memory, MiB | Energy, J |
---|---|---|---|
GDC fast | 0.14 ± 00.00 | 231.46 ± 00.12 | 3.37 ± 00.18 |
Rust Serde Custom | 0.19 ± 00.00 | 108.54 ± 00.07 | 5.14 ± 00.06 |
Rust Serde Typed | 0.20 ± 00.00 | 120.33 ± 00.13 | 5.74 ± 00.73 |
C++ simdjson | 0.31 ± 00.00 | 496.82 ± 00.92 | 8.29 ± 00.46 |
C++ gason | 0.32 ± 00.00 | 313.18 ± 00.09 | 8.28 ± 00.42 |
C++ RapidJSON | 0.36 ± 00.00 | 345.17 ± 00.09 | 11.32 ± 00.29 |
Java | 0.49 ± 00.01 | 569.25 ± 24.45 | 14.19 ± 00.74 |
Scala | 0.57 ± 00.00 | 461.18 ± 04.73 | 19.27 ± 00.86 |
Julia JSON3 | 0.83 ± 00.01 | 834.65 ± 00.68 | 22.72 ± 01.15 |
C++ RapidJSON SAX | 0.89 ± 00.00 | 110.19 ± 00.04 | 23.35 ± 00.63 |
Go jsoniter | 0.98 ± 00.01 | 225.83 ± 00.13 | 26.63 ± 00.76 |
Node.js | 0.98 ± 00.00 | 293.57 ± 03.03 | 37.35 ± 01.44 |
Rust Serde Untyped | 1.04 ± 00.08 | 916.54 ± 00.06 | 27.79 ± 01.44 |
PyPy | 1.07 ± 00.00 | 405.69 ± 00.23 | 29.33 ± 00.85 |
Perl Cpanel::JSON::XS | 1.29 ± 00.02 | 517.07 ± 00.03 | 35.33 ± 00.53 |
Crystal Schema | 1.43 ± 00.01 | 157.33 ± 00.12 | 39.11 ± 01.87 |
Crystal Pull | 1.44 ± 00.02 | 128.49 ± 00.03 | 40.26 ± 02.61 |
Clojure | 1.60 ± 00.04 | 1542.98 ± 20.03 | 64.27 ± 01.99 |
PHP | 1.65 ± 00.02 | 803.43 ± 00.08 | 42.46 ± 01.94 |
Crystal | 1.74 ± 00.01 | 503.61 ± 00.03 | 59.80 ± 02.46 |
CPython UltraJSON | 2.06 ± 00.01 | 689.01 ± 01.60 | 66.12 ± 00.57 |
Go | 2.08 ± 00.01 | 400.68 ± 00.14 | 58.15 ± 01.63 |
V GCC | 2.08 ± 00.01 | 591.95 ± 00.05 | 54.50 ± 00.48 |
V Clang | 2.08 ± 00.01 | 592.46 ± 00.02 | 54.88 ± 00.66 |
Python | 2.33 ± 00.01 | 519.47 ± 00.04 | 63.84 ± 02.25 |
C++ json-c | 2.70 ± 00.02 | 1756.22 ± 00.06 | 70.98 ± 01.54 |
Nim GCC | 2.99 ± 00.01 | 908.77 ± 00.05 | 78.37 ± 00.66 |
Nim Clang | 3.06 ± 00.01 | 909.29 ± 00.04 | 80.44 ± 01.25 |
[EDIT]
LDC
, GDC
, DMD
perform worse than nim
(see https://github.com/kostya/benchmarks#json) , however GDC FAST
performs a lot faster; looking more closely into it (see https://github.com/kostya/benchmarks/blob/master/json/build.sh), it's because of this specialized library: https://github.com/mleise/fast.gitgit clone --depth 1 https://github.com/mleise/fast.git
gdc -o json_d_gdc_fast -O3 -frelease test_fast.d fast/source/fast/cstring.d fast/source/fast/buffer.d fast/source/fast/json.d fast/source/fast/parsing.d fast/source/fast/intmath.d fast/source/fast/internal/sysdef.di fast/source/fast/internal/helpers.d fast/source/fast/unicode.d fast/source/fast/internal/unicode_tables.d fast/source/std/simd.d
as you can see, it does use simd, and perhaps other tricks; we need to look into it; in particular: https://github.com/mleise/fast/blob/master/source/fast/json.d
ldmd2 -O5 -release -inline -boundscheck=off
): 0.27, ie a factor 6.77 speedup over nim using std/json, but I expect gdc to be even fast according to benchmarkpackedjson improves a bit but not much: (and using parseFile over readFile + parseJson improves a tiny bit, but not even used in D)
import pkg/packedjson
let file = "/tmp/1.json"
let jobj = parseFile(file)
let coordinates = jobj["coordinates"]
let len = float(coordinates.len)
var x = 0.0
var y = 0.0
var z = 0.0
for coord in coordinates:
x += coord["x"].getFloat
y += coord["y"].getFloat
z += coord["z"].getFloat
echo x / len
echo y / len
echo z / len
on my machine, this goes from 0:02.63 down to 0:01.59 ie a 1.53X speedup
We need to gprof/profile this, the gap is not good for nim;
more importantly, the optimization opportunities (eg simd) are likely to generalize and benefit to other areas of nim beyond json parsing
Wocjiek Mula and Daniel Lemire are the same people who wrote the json parser btw
Something that might also be of interest is the (unfortunately compiler specific) function multi-versioning that both GCC and Clang provide.
but do we need function multi-versioning when we have when
?
The mechanism behind function multi-versioning happens at runtime, when the system loader is loading an executable and it's libraries.
This allows using a function implementation optimized for a particular instruction set extension, while still providing a fallback for older processors that don't have the instruction set extension.
In contrast, the when
mechanism works at compile time. You can use it to compile functions optimized for a particular instruction set extension, however execution of those functions on architectures that do not support that instruction set extension will result in the program crashing.
There is are ways to do this in a more agnostic fashion, however they either involve creating support dlls (compile multiple dlls from one set of code, each targeting a particular set of extensions) or creating object files and linking in some form of runtime dispatch.
Note: MSVC allow mixing SIMD in binaries and does not require specific flags so causes no issue.
Currently to allow a multiple SIMD paths in the same GCC or Clang executable we have the following options.
The SIMD flag can be passed as a {.localPassC:"-mavx".}
pragma for example.
This is what I use in Weave: https://github.com/mratsim/weave/blob/v0.3.0/benchmarks/matmul_gemm_blas/gemm_pure_nim/common/gemm_ukernel_avx2.nim#L6
In the past we had to use an undocumented feature of nim.cfg
to specify per-flag pragma but it caused deployment issues in Laser/Arraymancer.
This one is a bit more tricky at the moment until https://github.com/nim-lang/Nim/issues/10682 is solved.
We would be able to do that:
when defined(gcc) or defined(clang):
{.pragma: avx2, codegenDecl: "__attribute__((__target__("avx2"))) $# $#$#".}
else:
{.pragma: avx2.}
proc bar_generic() =
echo 1
proc bar_avx2() {.avx2.} =
echo 1
This can already be done but the it overwrites all the N_LIBPRIVATE, N_INLINE that nim adds so you need some macro magic to check if the proc is exported or inline to rebuild the proper codegenDecl.
Unfortunately Nim can rely on GCC function attribute dispatch AFAIK, they require the functions to have the same name in C. AFAIK it's possible via {.exportc.} but Nim complains (to be confirmed, it's been a while).
Alternatively, either the functions have a wrapper that checks the SIMD supported at runtime before calling the proper one. This can be done via:
static enum
on function entry (say json parsing and json exporting). Example: https://github.com/mratsim/weave/blob/v0.3.0/benchmarks/matmul_gemm_blas/gemm_pure_nim/gemm_weave.nim#L217-L254 with
type CPUFeatureX86* = enum
x86_Generic,
x86_SSE,
x86_SSE2,
x86_SSE4_1,
x86_AVX,
x86_AVX_FMA,
x86_AVX2,
All 3 dispatching techniques require the same amount of code duplication
The SIMD PR here https://github.com/nim-lang/Nim/pull/11816 is well done and I'm replacing my wrapper of Facebook's cpuinfo by it (which is the best lightweight CPU feature detection package, outside of NUMA or HyperThreading detection needs)
For what it's worth, at least for that kostya benchmark, SIMD is not so critical. See this pr and a test program. I get within 1.1x of D fast and within 1.4x of the Lemire optimized SIMD on demand (>700 MiB/s on my machine). (PGO build is very necessary - 1.5x speed up, as is often the case with Nim generated C.)
The executive summary is just that the current parsejson
just does the same work in triplicate. While that last 1.4x might be nice someday de-triplicating should probably be the initial focus.
I have made a faster json serializer and deserializer (in pure Nim, no SIMD) (https://github.com/treeform/jsony). Why is it faster?
Currently the Nim standard module first parses or serializes json into JsonNodes and then turns the JsonNodes into your objects with the to() macro. This is slower and creates unnecessary work for the garbage collector. My library skips the JsonNodes and creates the objects you want directly.
Another speed up comes from not using StringStream. Stream has a function dispatch overhead because it has to be able to switch between StringStream or FileStream at runtime. Jsony skips the overhead and just directly writes to memory buffers.
name ............................... min time avg time std dv times
treeform/jsony .................... 10.121 ms 10.809 ms ±1.208 x100
planetis-m/eminim ................. 12.006 ms 12.574 ms ±1.156 x100
nim std/json ...................... 23.282 ms 34.679 ms ±7.512 x100
name ............................... min time avg time std dv times
treeform/jsony ..................... 4.047 ms 4.172 ms ±0.225 x100
planetis-m/eminim .................. 7.173 ms 7.324 ms ±0.253 x100
disruptek/jason ................... 10.220 ms 11.155 ms ±0.689 x100
nim std/json ...................... 11.526 ms 15.181 ms ±0.857 x100
Here is the bench on my machine with nim-json-serialization:
However at the moment it brings Chronos and BearSSL as a dependency :/ https://github.com/status-im/nim-json-serialization/issues/25
@mratsim Thank very much for pointing json_serialization
library out. I studied the library on how it goes this fast and have implemented things I learned in mine:
name ............................... min time avg time std dv times
treeform/jsony ..................... 6.724 ms 12.047 ms ±3.694 x100
status-im/nim-json-serialization ... 7.119 ms 14.276 ms ±2.033 x100
nim std/json ...................... 24.141 ms 38.741 ms ±5.417 x100
planetis-m/eminim ................. 10.974 ms 18.355 ms ±3.994 x100
name ............................... min time avg time std dv times
treeform/jsony ..................... 1.531 ms 2.779 ms ±0.091 x100
status-im/nim-json-serialization ... 2.043 ms 3.448 ms ±0.746 x100
planetis-m/eminim .................. 5.951 ms 9.305 ms ±3.210 x100
disruptek/jason ................... 10.312 ms 13.471 ms ±3.107 x100
nim std/json ...................... 12.551 ms 19.419 ms ±4.039 x100
Lessons learned:
Its all about moving memory in optimal way and not using allocators for strings.
Not using parseInt
and $
was a big win. I had to write a print and parse numbers into a string with index. This is some thing that nim should probably provide. ( https://github.com/treeform/jsony/blob/master/src/jsony.nim#L471 )
Compares a string or a slice is also a lot slower then just comparing stuff character by character. ( https://github.com/treeform/jsony/blob/master/src/jsony.nim#L276 )
.add
on a string can be really slow. Its faster to grow the string only once, then fill the string, then cap it off to right length. (https://github.com/treeform/jsony/blob/master/src/jsony.nim#L524)
Using static compile time strings for keys was also a speedup.
template
is always faster then {.inline.} proc
even if they both just return a string.
I hope this is useful for any one who wants to implement fast parsers.
@treeform, I hope I'm not too forward, but I'd love to see you contributing to nim-json-serialization. I'm sure you'll be able to deliver similar optimisations there as well, because I haven't even tried to optimise the performance yet.
nim-json-serialization and jsony both share the key architectural trait that makes parsing fast. You skip the dynamic JsonNodes
and instead you directly populate a strongly typed object. I'd say that nim-json-serialization has some additional advantages:
1) It's part of the nim-serialization family of libraries conforming to the same standard API. We also have libraries for TOML, ProtoBuf, SSZ and other formats that can all be tested with a single generic test suite defined in nim-serialization.
2) Since we've used the library for a long time in many projects now, it had to learn to handle many tricky aspects of Nim such as case objects, inheritance, when statements in the record fields, etc. What's even better is that this knowledge is embedded in the parent nim-serialization library and as we improve it, it improves all of the formats together. The common serialization API allows many advanced features such as skipped or renamed fields and a lot of options when defining custom serializers.
3) nim-json-serialization builds on top of nim-faststreams instead of working with strings only. This makes the library significantly more versatile and you'll get the same stellar performance when working with files, sockets and other input sources.
I've also took a quick glance at the jsony source code and I can recognize some early mistakes that I did go through while developing nim-json-serialization. For example, if you handle the built-in types with when
statements instead of overloads, your users will face significantly less noisy error messages when dealing with overload visibility problems.
Lessons learned:
A lot of these lessons I would consider "tribal knowledge", I think it would help the whole community if these made their way into a more widely-shared document, perhaps an article on nim-lang.org.
@treeform I would encourage you to do a write up, but I understand if you don't have the time so I created an issue on the website repo for this: https://github.com/nim-lang/website/issues/251. If anyone is interested on doing a write up please say in the issue :)
There was a good article recently about unintuitive decisions Microsoft made with respect to string comparison algos in Windows. Maybe we should see if we can steal something from that when reimplementing strutils.
a few good reusable techniques at play here: https://github.com/samuell/gccontent-benchmark/pull/22
case c of ...
by array[low(char)..high(char), T]
where possiblehttps://github.com/nim-lang/Nim/pull/18183 shows how a trivial code change can give a 20x performance speedup (in jsonutils deserialization), by using a template instead of a proc which allows lazy param evaluation in cases like this:
checkJson ok, $(json.len, num, numMatched, $T, json)
# The 2nd param `msg` is only evaluated on failure if checkJson is a template, but not if it's a proc
as mentioned by @araq we could port/adapt/get inspired by https://github.com/lemire/simdjson to improve performance in several places, eg std/json or other parsers; here's the talk: https://www.youtube.com/watch?v=wlvKAT7SZIQ&list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&index=20&t=0s ; highly recommended
it has a number of distinct useful ideas to improve performance that could be used (not just for json parsing), as reusable components:
avoiding branching (to avoid branch misprediction) using clever tricks
using SIMD, AVX-512 etc
adding performance regression tests (we don't do that IIRC in testament)
UTF8 validation eg:
_mm256_subs_epu8(current_bytes, 244)
has no branches, 1 instruction checks 32 bytes at oncedynamic runtime dispatch (with caching of which function pointer to use) to pick the most optimized function for a given processor/architecture, without having to recompile
caveats regarding how to evaluate performance where branch predictions are involved: when using random data with a fixed seed (even with say 2000 iterations), the CPU might "learn" to perfectly do branch prediction but that won't reflect performance on real data where there is no fixed pattern
ARM neon and x64 processors have instructions to lookup 16 byte tables in a vectorized manner (16 values at a time): pshufb, tbl => could that be used to optimize
[EDIT]set[int]
set[char]
operations ?fast number parsing (number parsing is expensive: 10 branch misses per FP number) => https://youtu.be/wlvKAT7SZIQ?list=PLSXYOVE4fQ0-wJlM5CrS1e8XP2ZzvpYCe&t=2368
etc...
links