Closed AaronO closed 1 year ago
This is awesome! Do you know how new of a compiler it requires? The MSRV CI job is grumpy 🤷 .
This is awesome! Do you know how new of a compiler it requires? The MSRV CI job is grumpy 🤷 .
Sorry for the delayed response, I've been traveling. As @paolobarbolini pointed out the new MSRV would be 1.44, though we could avoid an MSRV bump and stick to 1.36 with some extra or less-idomatic code (see latest commit)
@seanmonstar It would be great to land #122 (the benchmarks & baseline) first, then I can polish this a little more before merging and releasing
@seanmonstar Update here:
With the latest changes this PR nearly doubles throughput on ARM, with no real downsides.
On x86_64 non-SIMD is now roughly on-par with SIMD in fact faster for smaller reqs/resps.
test req/req ... bench: 219 ns/iter (+/- 1)
test req_short/req_short ... bench: 28 ns/iter (+/- 0)
test resp/resp ... bench: 208 ns/iter (+/- 2)
test resp_short/resp_short ... bench: 35 ns/iter (+/- 0)
test req/req ... bench: 112 ns/iter (+/- 1)
test req_short/req_short ... bench: 26 ns/iter (+/- 0)
test resp/resp ... bench: 117 ns/iter (+/- 8)
test resp_short/resp_short ... bench: 28 ns/iter (+/- 0)
(Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz)
test req/req ... bench: 309 ns/iter (+/- 12)
test req_short/req_short ... bench: 66 ns/iter (+/- 5)
test resp/resp ... bench: 326 ns/iter (+/- 14)
test resp_short/resp_short ... bench: 66 ns/iter (+/- 5)
test req/req ... bench: 297 ns/iter (+/- 18)
test req_short/req_short ... bench: 67 ns/iter (+/- 3)
test resp/resp ... bench: 317 ns/iter (+/- 100)
test resp_short/resp_short ... bench: 66 ns/iter (+/- 7)
test req/req ... bench: 491 ns/iter (+/- 25)
test req_short/req_short ... bench: 65 ns/iter (+/- 2)
test resp/resp ... bench: 490 ns/iter (+/- 19)
test resp_short/resp_short ... bench: 82 ns/iter (+/- 3)
test req/req ... bench: 318 ns/iter (+/- 16)
test req_short/req_short ... bench: 54 ns/iter (+/- 2)
test resp/resp ... bench: 324 ns/iter (+/- 13)
test resp_short/resp_short ... bench: 62 ns/iter (+/- 3)
We're eagerly waiting for this PR in Deno 🙏
Revisiting this, I took a look at the SIMD code and noticed inefficiencies on larger inputs and with a few tweaks see 2-3x improvements for header-values/URIs greater than 128b, will submit another PR.
128b+ header-values are more common in production than you would expect (and obviously we would want to scale gracefully anyway), I don't have hard data on real-world distributions but anecdotally checked my Chrome headers on Google.com, Cookie is 1024b+ and a handful are 128b+
(Disclaimer: aggregated by ChatGPT, which "computed" the ratio rows which aren't exactly correct but close enough)
Test | 128b | 256b | 512b | 1024b | 2048b | 4096b |
---|---|---|---|---|---|---|
Before | ||||||
Header | 38 | 66 | 123 | 263 | 484 | 946 |
URI | 19 | 44 | 116 | 237 | 465 | 937 |
After | ||||||
Header | 30 | 39 | 55 | 88 | 193 | 300 |
URI | 12 | 20 | 35 | 65 | 127 | 270 |
Improvement | ||||||
Header Ratio | ~1.5x | ~1.5x | ~2.0x | ~3.0x | ~2.5x | ~3.0x |
URI Ratio | ~1.5x | ~2.0x | ~3.5x | ~3.5x | ~3.5x | ~3.5x |
before:
test header/value_128b ... bench: 38 ns/iter (+/- 3)
test header/value_256b ... bench: 66 ns/iter (+/- 0)
test header/value_512b ... bench: 123 ns/iter (+/- 2)
test header/value_1024b ... bench: 263 ns/iter (+/- 13)
test header/value_2048b ... bench: 484 ns/iter (+/- 19)
test header/value_4096b ... bench: 946 ns/iter (+/- 7)
test uri/uri_128b ... bench: 19 ns/iter (+/- 3)
test uri/uri_256b ... bench: 44 ns/iter (+/- 1)
test uri/uri_512b ... bench: 116 ns/iter (+/- 1)
test uri/uri_1024b ... bench: 237 ns/iter (+/- 3)
test uri/uri_2048b ... bench: 465 ns/iter (+/- 3)
test uri/uri_4096b ... bench: 937 ns/iter (+/- 58)
after:
test header/value_128b ... bench: 30 ns/iter (+/- 1)
test header/value_256b ... bench: 39 ns/iter (+/- 1)
test header/value_512b ... bench: 55 ns/iter (+/- 2)
test header/value_1024b ... bench: 88 ns/iter (+/- 4)
test header/value_2048b ... bench: 193 ns/iter (+/- 49)
test header/value_4096b ... bench: 300 ns/iter (+/- 4)
test uri/uri_128b ... bench: 12 ns/iter (+/- 3)
test uri/uri_256b ... bench: 20 ns/iter (+/- 0)
test uri/uri_512b ... bench: 35 ns/iter (+/- 1)
test uri/uri_1024b ... bench: 65 ns/iter (+/- 4)
test uri/uri_2048b ... bench: 127 ns/iter (+/- 2)
test uri/uri_4096b ... bench: 270 ns/iter (+/- 36)
@seanmonstar I would consider this PR good to land now, I'll start another PR with the SIMD improvements.
I have identified 2-3 other improvements that will help with "latency" and a cleanup, will try to get to those time permitting.
@seanmonstar The SIMD-related PR is up here: https://github.com/seanmonstar/httparse/pull/131
Small update, I've got a first-pass implementation of NEON support (based off this #123 and #132), the bench results are promising (M1 Air):
test req/req ... bench: 110 ns/iter (+/- 1)
test req_short/req_short ... bench: 26 ns/iter (+/- 0)
test resp/resp ... bench: 116 ns/iter (+/- 0)
test resp_short/resp_short ... bench: 28 ns/iter (+/- 0)
test uri/uri_1b ... bench: 1 ns/iter (+/- 0)
test uri/uri_2b ... bench: 2 ns/iter (+/- 0)
test uri/uri_4b ... bench: 3 ns/iter (+/- 0)
test uri/uri_8b ... bench: 2 ns/iter (+/- 0)
test uri/uri_16b ... bench: 2 ns/iter (+/- 0)
test uri/uri_32b ... bench: 2 ns/iter (+/- 0)
test uri/uri_64b ... bench: 4 ns/iter (+/- 0)
test uri/uri_128b ... bench: 7 ns/iter (+/- 0)
test uri/uri_256b ... bench: 14 ns/iter (+/- 0)
test uri/uri_512b ... bench: 26 ns/iter (+/- 0)
test uri/uri_1024b ... bench: 62 ns/iter (+/- 0)
test uri/uri_2048b ... bench: 112 ns/iter (+/- 3)
test uri/uri_4096b ... bench: 217 ns/iter (+/- 6)
test header/name_1b ... bench: 9 ns/iter (+/- 0)
test header/name_2b ... bench: 10 ns/iter (+/- 0)
test header/name_4b ... bench: 9 ns/iter (+/- 1)
test header/name_8b ... bench: 9 ns/iter (+/- 0)
test header/name_16b ... bench: 11 ns/iter (+/- 0)
test header/name_32b ... bench: 14 ns/iter (+/- 0)
test header/name_64b ... bench: 20 ns/iter (+/- 0)
test header/name_128b ... bench: 33 ns/iter (+/- 0)
test header/name_256b ... bench: 59 ns/iter (+/- 4)
test header/name_512b ... bench: 109 ns/iter (+/- 1)
test header/name_1024b ... bench: 220 ns/iter (+/- 2)
test header/name_2048b ... bench: 424 ns/iter (+/- 13)
test header/name_4096b ... bench: 832 ns/iter (+/- 10)
test header/value_1b ... bench: 9 ns/iter (+/- 0)
test header/value_2b ... bench: 11 ns/iter (+/- 0)
test header/value_4b ... bench: 13 ns/iter (+/- 0)
test header/value_8b ... bench: 9 ns/iter (+/- 0)
test header/value_16b ... bench: 10 ns/iter (+/- 0)
test header/value_32b ... bench: 10 ns/iter (+/- 0)
test header/value_64b ... bench: 13 ns/iter (+/- 3)
test header/value_128b ... bench: 15 ns/iter (+/- 0)
test header/value_256b ... bench: 20 ns/iter (+/- 0)
test header/value_512b ... bench: 33 ns/iter (+/- 0)
test header/value_1024b ... bench: 57 ns/iter (+/- 0)
test header/value_2048b ... bench: 119 ns/iter (+/- 0)
test header/value_4096b ... bench: 224 ns/iter (+/- 2)
test header/count_1 ... bench: 9 ns/iter (+/- 0)
test header/count_2 ... bench: 17 ns/iter (+/- 0)
test header/count_4 ... bench: 32 ns/iter (+/- 0)
test header/count_8 ... bench: 60 ns/iter (+/- 1)
test header/count_16 ... bench: 141 ns/iter (+/- 2)
test header/count_32 ... bench: 281 ns/iter (+/- 11)
test header/count_64 ... bench: 556 ns/iter (+/- 5)
test header/count_128 ... bench: 1112 ns/iter (+/- 27)
test version/http10 ... bench: 0 ns/iter (+/- 0)
test version/http11 ... bench: 0 ns/iter (+/- 0)
test version/partial ... bench: 1 ns/iter (+/- 0)
test method/get ... bench: 0 ns/iter (+/- 0)
test method/head ... bench: 2 ns/iter (+/- 0)
test method/post ... bench: 0 ns/iter (+/- 0)
test method/put ... bench: 2 ns/iter (+/- 0)
test method/delete ... bench: 3 ns/iter (+/- 0)
test method/connect ... bench: 3 ns/iter (+/- 3)
test method/options ... bench: 3 ns/iter (+/- 0)
test method/trace ... bench: 2 ns/iter (+/- 0)
test method/patch ... bench: 2 ns/iter (+/- 0)
test method/custom ... bench: 3 ns/iter (+/- 0)
I implemented a vectorized header-name validator but without other changes it's a net-negative since in practice header-names are often less than 16 bytes.
Benches with vectorized header-name:
test req/req ... bench: 174 ns/iter (+/- 23)
test req_short/req_short ... bench: 34 ns/iter (+/- 0)
test resp/resp ... bench: 187 ns/iter (+/- 3)
test resp_short/resp_short ... bench: 33 ns/iter (+/- 0)
test header/name_1b ... bench: 10 ns/iter (+/- 0)
test header/name_2b ... bench: 10 ns/iter (+/- 0)
test header/name_4b ... bench: 10 ns/iter (+/- 0)
test header/name_8b ... bench: 10 ns/iter (+/- 0)
test header/name_16b ... bench: 10 ns/iter (+/- 3)
test header/name_32b ... bench: 11 ns/iter (+/- 25)
test header/name_64b ... bench: 13 ns/iter (+/- 0)
test header/name_128b ... bench: 16 ns/iter (+/- 0)
test header/name_256b ... bench: 23 ns/iter (+/- 5)
test header/name_512b ... bench: 36 ns/iter (+/- 0)
test header/name_1024b ... bench: 63 ns/iter (+/- 4)
test header/name_2048b ... bench: 128 ns/iter (+/- 2)
test header/name_4096b ... bench: 237 ns/iter (+/- 8)
test header/count_1 ... bench: 10 ns/iter (+/- 0)
test header/count_2 ... bench: 18 ns/iter (+/- 0)
test header/count_4 ... bench: 41 ns/iter (+/- 0)
test header/count_8 ... bench: 121 ns/iter (+/- 4)
test header/count_16 ... bench: 248 ns/iter (+/- 5)
test header/count_32 ... bench: 498 ns/iter (+/- 10)
test header/count_64 ... bench: 1003 ns/iter (+/- 18)
test header/count_128 ... bench: 2006 ns/iter (+/- 28)
I do think there's room for improvement, specifically the handoff between SIMD and non-SIMD is not ideal right now. I'll investigate that if we can land these other PRs first
@seanmonstar NEON PR is up: https://github.com/seanmonstar/httparse/pull/133
The PRs mostly build off each other and can be landed/reviewed in series.
A 1st pass at improving non-SIMD perf, to land after https://github.com/seanmonstar/httparse/pull/122
On aarch64/M1 we observe a ~5x asymptotic improvement in uri parsing and ~2.5x in header names & values.
We also observe a -40% time reduction in the
req/req
bench.I briefly benched it on x86_64, we see a -20% time reduction in the
req/req
bench there for non-SIMD, needs more testing/tweaks on x86_64 and possibly only enabling these fastpaths if SIMD is offBench results
Prev baseline (from #122)