Closed AaronO closed 1 year ago
Thanks for the PR! That's cool to see such an improvement! I wonder what it is, is it removing the increment on each access? I intuitively would have assumed this would be slower in the longer requests, since it's copying a lot of the bytes into a stack array...
(I also notice that using const generics increases the MSRV significantly.)
Thanks for the PR! That's cool to see such an improvement! I wonder what it is, is it removing the increment on each access? I intuitively would have assumed this would be slower in the longer requests, since it's copying a lot of the bytes into a stack array...
(I also notice that using const generics increases the MSRV significantly.)
I'll tweak it to avoid bumping MSRV (const generics aren't really core).
Regarding performance, the previous code probably had ~7 instructions per byte (bounds checking, modifying pos
for each input byte, etc...), with the new code the 8 bytes of input will essentially get loaded into a single 64 bit register and then a single equality check (I believe rust/llvm actually codegen 3 compares here, 1 for the shared prefix of the match arms and then a 1 byte cmp for each remaining branch).
The current code definitely isn't optimal, the PR started as a "drive by" cleanup, I'll possibly do a performance centric follow up PR
I briefly played around with other perf improvements, and have so far achieved a ~20-25% improvement over master
(benched on M1 Air, so without SIMD, though neon might be worth exploring).
## master
test req/req ... bench: 241 ns/iter (+/- 2)
test req_short/req_short ... bench: 32 ns/iter (+/- 1)
test resp/resp ... bench: 228 ns/iter (+/- 3)
test resp_short/resp_short ... bench: 40 ns/iter (+/- 0)
## PR #120 + other tweaks
test req/req ... bench: 181 ns/iter (+/- 2)
test req_short/req_short ... bench: 25 ns/iter (+/- 4)
test resp/resp ... bench: 183 ns/iter (+/- 6)
test resp_short/resp_short ... bench: 32 ns/iter (+/- 0)
(using --output-format=bencher
for brevity)
I have a few other ideas for further improvements, happy to throw up PRs if you're interested
Improved throughput by +50-60% relative to master
:
req/req time: [147.23 ns 147.42 ns 147.65 ns]
thrpt: [4.5414 GiB/s 4.5487 GiB/s 4.5543 GiB/s]
change:
time: [-38.665% -38.497% -38.343%] (p = 0.00 < 0.05)
thrpt: [+62.187% +62.592% +63.039%]
Performance has improved.
resp/resp time: [153.49 ns 153.92 ns 154.62 ns]
thrpt: [4.2103 GiB/s 4.2295 GiB/s 4.2413 GiB/s]
change:
time: [-32.746% -32.537% -32.220%] (p = 0.00 < 0.05)
thrpt: [+47.535% +48.229% +48.691%]
Performance has improved.
Don't have too much time today to explore further, but a 50% reduction in time and thus a 2x in throughput might be possible
This is phenomenal work, thank you!
@seanmonstar we'd love to try this change in our new server that uses httparse
, any chance you could release a new version?
Yep!
Improved throughput by +50-60% relative to
master
:
@bartlomieju This doesn't include all the improvements mentioned in the latest comment, those were additional improvements I've implemented locally.
This PR mainly optimizes parse_version
, shaving off ~10% off req_short, my local changes optimize header & url parsing (since all the other parts are roughly constant size) and achieves a -40% reduction in time or +60% in throughput.
I've tested further changes that allow me to achieve over 5 GiB/s and roughly 2x faster than the previous master
Ah makes sense, @seanmonstar I'm fine waiting another few days for more changes like this!
Though, it doesn't cost much to publish a release. I can publish one now, and merge more improvements later. Or if you nearly have them ready, I can hold off, whichever you prefer.
@seanmonstar thanks, if that's not a big deal then I'd kindly ask to release now :)
Also improves performance, mainly
parse_version
, we observe a ~10% improvement onreq_short
: