Open lucab opened 5 months ago
Note, cursory benches for this on M1 / arm64 show a ~10% perf regression vs main
@seanmonstar I can take a deeper look at this before reconsidering landing it, haven't looked into it in depth, but potentially there could be gains on avx2 simd from this approach because it avoids reprocessing an entire/last block on SIMD miss, allowing the SWAR or byte-wise code to pick up.
Since avx2 operates on 32B/256bit blocks it possibly "misses" a lot on header-values, short URLs etc...
So definitely a variant of this is worth considering, but I think it should be rigorously benchmarked given those shared in #156 were unclear... @seanmonstar probably worth adding CI benches (which can be noisy, but probably not meaningfully so these pure compute bound flows)
Also a big factor here could be avx2
being inlined or dispatched at runtime (when generically compiled for x64), the inlined version compiled for avx2 will obviously have less overhead.
@lucab Also are all the pub
=> pub(crate)
changes necessary ?
@seanmonstar probably worth adding CI benches (which can be noisy, but probably not meaningfully so these pure compute bound flows)
Did a first pass at CI benches: https://github.com/seanmonstar/httparse/pull/179, this PR should show a measurable delta
@AaronO from https://github.com/seanmonstar/httparse/pull/156#issuecomment-1980339794:
There is a minor cleanup bundled in this PR (marking several functions as pub(crate)) which I did in order to make sure I wasn't changing public APIs. I can split that to a dedicated PR if you prefer.
@AaronO from #156 (comment):
There is a minor cleanup bundled in this PR (marking several functions as pub(crate)) which I did in order to make sure I wasn't changing public APIs. I can split that to a dedicated PR if you prefer.
@lucab Yes I think that should be split out. Especially since the original PR made substantial (AVX2) perf claims, it's best if we can focus on that and remove unrelated "noise". The alt refactor should be judged on its own merits. The focus here should be demonstrating, explaining clear perf benefits. From the benchmarks (local [arm64] & CI) that appears unclear.
@lucab Modern CPUs with speculative execution can counterintuitively be faster "processing the same bytes twice" if both execution paths are independent. There can also be tradeoffs impacting inline-ability if you increase register usage in hot loop, etc...
I touched this code a while back so not 100% sure, but I think I explored your change (if I'm distilling it correctly to returning partial matches) and concluded (based off local M1 benches at the time) that it wasn't a net positive.
For example your original PR #156, explains the improvement with:
This refactors all SIMD modules in order to make the value-matching logic self-contained. Thus, all bytes-cursor manipulations are now grouped and performed once at the end, outside of SIMD logic.
I should double check but IIRC, minimizing the amount of times you update the bytes-cursor sounds like a plausible improvement but in practice it impacts lowering. What actually happens here is that the "stack-allocated" Bytes
struct actually gets allocated into 3 registers (start, cursor, end), we're always reading from cursor, with your code I think it gets lowered as basepointer+offset or ends up requiring an extra register for that inner loop.
Long story short, we should really distill this perf improvement hypothesis to its essentials and fully test/understand that.
@lucab Ok looking at the CI benches there does appear to be a beyind-noise improvement for sse42
(not avx2
, but possibly I missed something in the CI benches).
It smells like it might boil down to inline-ability of sse42
. Would need to double check, I'll take a deeper look tomorrow.
In part it's a perf bug vs purely raising the ceiling, especially for small values sse42
was substantially underperforming swar
, possibly due to (not-)inlining (if sse42
is called non-inlined in the loop you pay some painful register init costs) or could be register alloc or something else.
@AaronO Ack, I'll split the pub
cleanup as I was originally offering to.
I've marked this PR as a draft, as I think we aren't going to land this as-is anyway.
For context, I wasn't really looking at perf improvements when I opened this, but it happened as a local side effect:
My goal was to (eventually) use the
bytes
crate in this library, but then I realized that the required SIMD-related groundwork was actually providing performance improvements on its own.
This reapplies https://github.com/seanmonstar/httparse/pull/156 with an additional commit on top in order to fix https://github.com/seanmonstar/httparse/issues/172.