Open ngrodzitski opened 1 year ago
Here are the details behind the change: https://gist.github.com/ngrodzitski/f9de8fb60ae44e7c5455f9b78a2e4134
@anonrig @RafaelGSS I'm not familiar with SSE. Can you please chime in here?
@ShogunPanda I'll have a look when I am back in Montreal.
This is very reasonable algorithmically. @ngrodzitski is using vectorized classification... See section "3.1.2 Vectorized Classification" in Parsing Gigabytes of JSON per Second. It is used extensively within simdutf, for example. The current implementation is specific to x64: one might want to extend it to other architectures like NEON (though it is not a limitation specific to this PR).
My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).
If I am correct, then this work might be in vain, as far as Node.js is concerned.
So before I review more carefully the PR, I'd be interested in knowing whether we are making sure that this code is actually being put to good use.
If not, then that's what we should fix first. (It is easily fixable, thankfully.)
@ngrodzitski : are your benchmarks relevant to Node.js, or did you manually compile the code with your own flags?
My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).
That's correct. That would need to be done in the deps/llhttp
first so we could, technically include Node.js, but I'm almost sure it would need a new dist creation (or at the very least - updating the current building flow) to have this support.
I assume this is a huge effort, right @richardlau?
@lemire You are right, the source from which I learned this technique is simdjson. I've only added nice automation with z3 so that this works for generic LUT (if possible per se).
I can't say for Node.js, but I've recently ported restinio which initially used old nodejs/http_parser to llhttp, thus I've given llhttp a closer look which eventually brings me here because my interest is that llhttp is fast (and correct of course).
So regarding a performance check I use llhttp repo which has a simple benchmark in it.
Here How I did it:
So my data (20% and 11%) is for llhttp.
My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).
That's correct. That would need to be done in the
deps/llhttp
first so we could, technically include Node.js, but I'm almost sure it would need a new dist creation (or at the very least - updating the current building flow) to have this support.I assume this is a huge effort, right @richardlau?
I'm not entirely sure what is being asked. I'm not very familiar with llhttp and how it is generated using llparse -- for Node.js the gyp files used to build llhttp are:
I'm not entirely sure what is being asked. I'm not very familiar with llhttp and how it is generated using llparse -- for Node.js the gyp files used to build llhttp are:
I meant, in case we want to support sse4.2, we'll need to build llhttp
with that support on Node.js and considering this might be an arch-specific build, it should be a different distributed binary in the release section, right? If so, it should be hard to support I guess - mostly due to the maintenance burden.
I don't think we'd need an arch specific build -- I think usually the dependency (in this case llhttp) would have to to detect and select the code to use at runtime. I don't know how easy that will be to do in llhttp so "huge effort" may or not be correct (albeit I'd expect most of the effort to be on the llhttp side rather than the Node.js side). sse4.2 in Node.js shows hits in base64 and zlib (although https://github.com/nodejs/node/commit/ae115d68e010cf8860e15a5e8259e7dc520b26a5 commented this out for zlib).
And to follow up, the sse4.2 code would need to be in a separate compilation unit (i.e. .c file).
@richardlau Yes. You definitively do not want to pass an architecture specific flag to the compiler when building Node.js, unless you want to build your own private version.
the sse4.2 code would need to be in a separate compilation unit (i.e. .c file).
That's one way to do it. If you look at simdutf (which is already in Node.js), you will find that there is just one compilation unit (literally just one .cpp).
In any case, it is solvable here as well if we care.
If so, it should be hard to support I guess - mostly due to the maintenance burden.
See simdutf for a reference on how you can make it easy to build (just one .cpp or .c file would do).
@RafaelGSS Do you have a good http benchmark already that could be used if one wants to demonstrate benefits?
@RafaelGSS Do you have a good http benchmark already that could be used if one wants to demonstrate benefits?
I'm not sure about the impacts of this optimization, but I believe https://github.com/nodejs/node/blob/main/benchmark/http/simple.js should be enough if that's a general parsing optimization. Using framework benchmarks should be good as well (see: https://github.com/fastify/benchmarks)
This commit optimizes the scan of non-state-changing bytes using SSE2 instructions.
A _mm_cmpestri operation appears to be quite slow compared to alternative approach that involves _mm_shuffle_epi8 for low/high nibble of the input and using bitwise-and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any) that changes the state. That is done by figuring out the first byte that LUTs to zero.
The tricky part here is the following:
To find
A
andB
satisfying the above conditions a Z3 library is used. The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit) was required to handle the async API for z3-wrapper.Using llhttp as a benchmark framework this change draws the following improvemnts:
For more header-fields-heavy messages numbers might be even more convincing.