cleanup: remove Bytes8 abstraction

AaronO commented 1 year ago

Also improves performance, mainly parse_version, we observe a ~10% improvement on req_short:

req_short/req_short     time:   [29.214 ns 29.296 ns 29.391 ns]
                        thrpt:  [2.1548 GiB/s 2.1617 GiB/s 2.1678 GiB/s]
                 change:
                        time:   [-10.968% -9.6366% -7.9937%] (p = 0.00 < 0.05)
                        thrpt:  [+8.6882% +10.664% +12.319%]
                        Performance has improved.

seanmonstar commented 1 year ago

Thanks for the PR! That's cool to see such an improvement! I wonder what it is, is it removing the increment on each access? I intuitively would have assumed this would be slower in the longer requests, since it's copying a lot of the bytes into a stack array...

(I also notice that using const generics increases the MSRV significantly.)

AaronO commented 1 year ago

Thanks for the PR! That's cool to see such an improvement! I wonder what it is, is it removing the increment on each access? I intuitively would have assumed this would be slower in the longer requests, since it's copying a lot of the bytes into a stack array...

(I also notice that using const generics increases the MSRV significantly.)

I'll tweak it to avoid bumping MSRV (const generics aren't really core).

Regarding performance, the previous code probably had ~7 instructions per byte (bounds checking, modifying pos for each input byte, etc...), with the new code the 8 bytes of input will essentially get loaded into a single 64 bit register and then a single equality check (I believe rust/llvm actually codegen 3 compares here, 1 for the shared prefix of the match arms and then a 1 byte cmp for each remaining branch).

The current code definitely isn't optimal, the PR started as a "drive by" cleanup, I'll possibly do a performance centric follow up PR

AaronO commented 1 year ago

I briefly played around with other perf improvements, and have so far achieved a ~20-25% improvement over master (benched on M1 Air, so without SIMD, though neon might be worth exploring).

## master
test req/req ... bench:                     241 ns/iter (+/- 2)
test req_short/req_short ... bench:         32 ns/iter (+/- 1)
test resp/resp ... bench:                   228 ns/iter (+/- 3)
test resp_short/resp_short ... bench:       40 ns/iter (+/- 0)

## PR #120 + other tweaks
test req/req ... bench:                     181 ns/iter (+/- 2)
test req_short/req_short ... bench:         25 ns/iter (+/- 4)
test resp/resp ... bench:                   183 ns/iter (+/- 6)
test resp_short/resp_short ... bench:       32 ns/iter (+/- 0)

(using --output-format=bencher for brevity)

I have a few other ideas for further improvements, happy to throw up PRs if you're interested

AaronO commented 1 year ago

Improved throughput by +50-60% relative to master:

req/req                 time:   [147.23 ns 147.42 ns 147.65 ns]
                        thrpt:  [4.5414 GiB/s 4.5487 GiB/s 4.5543 GiB/s]
                 change:
                        time:   [-38.665% -38.497% -38.343%] (p = 0.00 < 0.05)
                        thrpt:  [+62.187% +62.592% +63.039%]
                        Performance has improved.

resp/resp               time:   [153.49 ns 153.92 ns 154.62 ns]
                        thrpt:  [4.2103 GiB/s 4.2295 GiB/s 4.2413 GiB/s]
                 change:
                        time:   [-32.746% -32.537% -32.220%] (p = 0.00 < 0.05)
                        thrpt:  [+47.535% +48.229% +48.691%]
                        Performance has improved.

Don't have too much time today to explore further, but a 50% reduction in time and thus a 2x in throughput might be possible

seanmonstar commented 1 year ago

This is phenomenal work, thank you!

bartlomieju commented 1 year ago

@seanmonstar we'd love to try this change in our new server that uses httparse, any chance you could release a new version?

seanmonstar commented 1 year ago

Yep!

AaronO commented 1 year ago

Improved throughput by +50-60% relative to master:

@bartlomieju This doesn't include all the improvements mentioned in the latest comment, those were additional improvements I've implemented locally.

This PR mainly optimizes parse_version, shaving off ~10% off req_short, my local changes optimize header & url parsing (since all the other parts are roughly constant size) and achieves a -40% reduction in time or +60% in throughput.

I've tested further changes that allow me to achieve over 5 GiB/s and roughly 2x faster than the previous master

bartlomieju commented 1 year ago

Ah makes sense, @seanmonstar I'm fine waiting another few days for more changes like this!

seanmonstar commented 1 year ago

Though, it doesn't cost much to publish a release. I can publish one now, and merge more improvements later. Or if you nearly have them ready, I can hold off, whichever you prefer.

bartlomieju commented 1 year ago

@seanmonstar thanks, if that's not a big deal then I'd kindly ask to release now :)

seanmonstar / httparse

cleanup: remove Bytes8 abstraction #120