Drastically improving throughput on larger inputs (3x+ for large URIs or header-values)
There are 2 optimizations in this PR:
Removing two unnecessary instructions when computing trailizing_zeros / bytes-validated.
We don't need to or the upper half of the register with xFF we can instead compute trailing-zeros on the meaningful bits by using eax (u32) instead of rax (u64) and ax (u16) instead of eax (u32) for AVX2 and SSE4.2 respectively.
Correctly scoping target_feature pragmas to allow SIMD validators to be inlined, so when looped we benefit from greater register reuse etc... See:
Drastically improving throughput on larger inputs (3x+ for large URIs or header-values)
There are 2 optimizations in this PR:
or
the upper half of the register withxFF
we can instead compute trailing-zeros on the meaningful bits by usingeax
(u32
) instead ofrax
(u64
) andax
(u16
) instead ofeax
(u32
) for AVX2 and SSE4.2 respectively.target_feature
pragmas to allow SIMD validators to be inlined, so when looped we benefit from greater register reuse etc... See:Benchmarks
Summary table
(Disclaimer: aggregated by ChatGPT, which "computed" the ratio rows which aren't exactly correct but close enough)
Raw benches