nodejs / llparse

Generating parsers in LLVM IR
http://llparse.org
Other
584 stars 30 forks source link

Runtime SSE4.2 dispatch #26

Closed zbjornson closed 4 years ago

zbjornson commented 5 years ago

Allows SSE4.2 to be used at runtime from a single build (i.e. Node.js's build, which AFAIK doesn't globally enable SSE4.2 at compile time, would let SSE4.2 be used at runtime). https://godbolt.org/z/i_6qgP demos the concept with a bunch of compilers.

This is sort of awful but wanted something concrete to discuss.

My initial goal was to get llparse to emit a tableLookup function that itself could be multiversioned, so there's no (non-inlinable) fn call inside of a loop. However, llparse seems to be node0-oriented, whereas runtime dispatching requires control over generation of functions: notice in the godbolt link that both a fn declaration and definition are required for each CPU feature set (base, sse4.2), plus a dispatcher function. So for now, there's an ugly, tiny pcmpestri wrapper function that gets called from the loop.

The impact of having the call in a loop isn't as bad as I expected when SSE4.2 is available, but it's heavy when SSE4.2 is not. Using a static int to cache the no-sse4.2 return status inside the lookup function improves the no-SSE4.2 result some (+~200 MB/s), but impacts the SSE4.2 results (-~100 MB/s). (edit I think the big hit to the non-SSE4.2 perf is because that branch is being checked for every single character. That might be possible to improve with branch hints/code reorg, although that itself is hard to do with llparse.)

Test (mb/s) This, SSE This, base master, SSE master, base
url loose 1060.27 1363.60 1579.95 1584.91
url strict 1588.11 1503.11 1511.61 1572.31
http L: "seanmonstar/httparse" 1452.94 326.26 1623.52 1019.67
http S: "seanmonstar/httparse" 1503.82 336.68 1536.25 1070.23
http L: "nodejs/http-parser" 1093.67 300.00 1246.01 909.21
http S: "nodejs/http-parser" 1118.91 347.50 1136.70 919.46

Figuring out a clean approach to dispatching would be good. There are other CPU features that could be used in the future for e.g. fast CRLF searching.

0node, as in AST node 1called findRanges1 so that a findRanges2 could be made that has two pcmpestris to reduce overhead

indutny commented 5 years ago

Hello!

Thanks for testing it. The results are very interesting!

It looks like what I was afraid of has manifested itself very clearly in benchmarks. The dynamic dispatch takes a lot of time to do. The apparently problem is that the check always runs.

JIT compilers usually solve this by hot-patching the call-sites to short-circuit them into either of supported modes of operation (SSE or no-SSE), but this is very problematic for C language.

I'm not sure if I have any constructive ideas or suggestions on how to move forward with this experiment. Sorry!

zbjornson commented 5 years ago

I take it there's no Node subclass or something that always emits a function instead of emitting source that's inlined into another function? This can be made ~zero-cost if there is.

Do you have plans for how to make the pcmpestri change benefit "normal" Node.js users, or will it be limited to users who build their own Node.js with custom -m flags?

zbjornson commented 5 years ago

(It looks like helpers/match-sequence.ts emits a standalone function, for example. Making a helpers/table-lookup.ts could work if that pattern is reusable.)