Optimize a scan of non state-chaning bytes with SSSE3 instructions

ngrodzitski commented 1 year ago

This commit optimizes the scan of non-state-changing bytes using SSE2 instructions.

A _mm_cmpestri operation appears to be quite slow compared to alternative approach that involves _mm_shuffle_epi8 for low/high nibble of the input and using bitwise-and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any) that changes the state. That is done by figuring out the first byte that LUTs to zero.

The tricky part here is the following:

Find A, B arrays (uint8_t[16]) such that
* `A[i] & B[j] == 0` if `LUT[i | (j <<4)] == 0`
* `A[i] & B[j] != 0` if `LUT[i | (j <<4)] != 0` // Note we don't need any specific non-zero value
for all i,j = 0..15.

To find A and B satisfying the above conditions a Z3 library is used. The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit) was required to handle the async API for z3-wrapper.

Using llhttp as a benchmark framework this change draws the following improvemnts:

Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz

http: "seanmonstar/httparse" (C)
BEFORE: 8192.00 mb | 1456.72 mb/s | 2172811.81 ops/sec | 5.62 s
AFTER:  8192.00 mb | 1752.90 mb/s | 2614577.82 ops/sec | 4.67 s

~20% improvement

http: "nodejs/http-parser" (C)
BEFORE: 8192.00 mb | 1050.60 mb/s | 2118535.14 ops/sec | 7.80 s
AFTER:  8192.00 mb | 1167.42 mb/s | 2354101.76 ops/sec | 7.02 s

~11% improvement

For more header-fields-heavy messages numbers might be even more convincing.

ngrodzitski commented 1 year ago

Here are the details behind the change: https://gist.github.com/ngrodzitski/f9de8fb60ae44e7c5455f9b78a2e4134

ShogunPanda commented 11 months ago

@anonrig @RafaelGSS I'm not familiar with SSE. Can you please chime in here?

lemire commented 11 months ago

@ShogunPanda I'll have a look when I am back in Montreal.

lemire commented 11 months ago

This is very reasonable algorithmically. @ngrodzitski is using vectorized classification... See section "3.1.2 Vectorized Classification" in Parsing Gigabytes of JSON per Second. It is used extensively within simdutf, for example. The current implementation is specific to x64: one might want to extend it to other architectures like NEON (though it is not a limitation specific to this PR).

My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).

If I am correct, then this work might be in vain, as far as Node.js is concerned.

So before I review more carefully the PR, I'd be interested in knowing whether we are making sure that this code is actually being put to good use.

If not, then that's what we should fix first. (It is easily fixable, thankfully.)

@ngrodzitski : are your benchmarks relevant to Node.js, or did you manually compile the code with your own flags?

RafaelGSS commented 11 months ago

My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).

That's correct. That would need to be done in the deps/llhttp first so we could, technically include Node.js, but I'm almost sure it would need a new dist creation (or at the very least - updating the current building flow) to have this support.

I assume this is a huge effort, right @richardlau?

ngrodzitski commented 11 months ago

@lemire You are right, the source from which I learned this technique is simdjson. I've only added nice automation with z3 so that this works for generic LUT (if possible per se).

I can't say for Node.js, but I've recently ported restinio which initially used old nodejs/http_parser to llhttp, thus I've given llhttp a closer look which eventually brings me here because my interest is that llhttp is fast (and correct of course).

So regarding a performance check I use llhttp repo which has a simple benchmark in it.

Here How I did it:

Cloned llparse and llhttp locally.
Did the changes in llparse.
Used npm-link so that my llocal llparse version is used in llhttp.
Rebuild llhttp and run a benchmark.

So my data (20% and 11%) is for llhttp.

richardlau commented 11 months ago

My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).

That's correct. That would need to be done in the deps/llhttp first so we could, technically include Node.js, but I'm almost sure it would need a new dist creation (or at the very least - updating the current building flow) to have this support.

I assume this is a huge effort, right @richardlau?

I'm not entirely sure what is being asked. I'm not very familiar with llhttp and how it is generated using llparse -- for Node.js the gyp files used to build llhttp are:

RafaelGSS commented 11 months ago

I'm not entirely sure what is being asked. I'm not very familiar with llhttp and how it is generated using llparse -- for Node.js the gyp files used to build llhttp are:

I meant, in case we want to support sse4.2, we'll need to build llhttp with that support on Node.js and considering this might be an arch-specific build, it should be a different distributed binary in the release section, right? If so, it should be hard to support I guess - mostly due to the maintenance burden.

richardlau commented 11 months ago

I don't think we'd need an arch specific build -- I think usually the dependency (in this case llhttp) would have to to detect and select the code to use at runtime. I don't know how easy that will be to do in llhttp so "huge effort" may or not be correct (albeit I'd expect most of the effort to be on the llhttp side rather than the Node.js side). sse4.2 in Node.js shows hits in base64 and zlib (although https://github.com/nodejs/node/commit/ae115d68e010cf8860e15a5e8259e7dc520b26a5 commented this out for zlib).

richardlau commented 11 months ago

And to follow up, the sse4.2 code would need to be in a separate compilation unit (i.e. .c file).

lemire commented 11 months ago

@richardlau Yes. You definitively do not want to pass an architecture specific flag to the compiler when building Node.js, unless you want to build your own private version.

the sse4.2 code would need to be in a separate compilation unit (i.e. .c file).

That's one way to do it. If you look at simdutf (which is already in Node.js), you will find that there is just one compilation unit (literally just one .cpp).

In any case, it is solvable here as well if we care.

lemire commented 11 months ago

If so, it should be hard to support I guess - mostly due to the maintenance burden.

See simdutf for a reference on how you can make it easy to build (just one .cpp or .c file would do).

lemire commented 11 months ago

@RafaelGSS Do you have a good http benchmark already that could be used if one wants to demonstrate benefits?

RafaelGSS commented 11 months ago

@RafaelGSS Do you have a good http benchmark already that could be used if one wants to demonstrate benefits?

I'm not sure about the impacts of this optimization, but I believe https://github.com/nodejs/node/blob/main/benchmark/http/simple.js should be enough if that's a general parsing optimization. Using framework benchmarks should be good as well (see: https://github.com/fastify/benchmarks)

nodejs / llparse

Optimize a scan of non state-chaning bytes with SSSE3 instructions #58