seanmonstar / httparse

A push parser for the HTTP 1.x protocol in Rust.
https://docs.rs/httparse
Apache License 2.0
567 stars 111 forks source link

perf: fix SIMD-inlining #131

Closed AaronO closed 1 year ago

AaronO commented 1 year ago

Drastically improving throughput on larger inputs (3x+ for large URIs or header-values)

There are 2 optimizations in this PR:

  1. Removing two unnecessary instructions when computing trailizing_zeros / bytes-validated. We don't need to or the upper half of the register with xFF we can instead compute trailing-zeros on the meaningful bits by using eax (u32) instead of rax (u64) and ax (u16) instead of eax (u32) for AVX2 and SSE4.2 respectively.
  2. Correctly scoping target_feature pragmas to allow SIMD validators to be inlined, so when looped we benefit from greater register reuse etc... See:

Benchmarks

Summary table

(Disclaimer: aggregated by ChatGPT, which "computed" the ratio rows which aren't exactly correct but close enough)

Test 128b 256b 512b 1024b 2048b 4096b
Before
Header 38 66 123 263 484 946
URI 19 44 116 237 465 937
After
Header 30 39 55 88 193 300
URI 12 20 35 65 127 270
Improvement
Header Ratio ~1.5x ~1.5x ~2.0x ~3.0x ~2.5x ~3.0x
URI Ratio ~1.5x ~2.0x ~3.5x ~3.5x ~3.5x ~3.5x

Raw benches

before:
test header/value_128b ... bench:           38 ns/iter (+/- 3)
test header/value_256b ... bench:           66 ns/iter (+/- 0)
test header/value_512b ... bench:           123 ns/iter (+/- 2)
test header/value_1024b ... bench:          263 ns/iter (+/- 13)
test header/value_2048b ... bench:          484 ns/iter (+/- 19)
test header/value_4096b ... bench:          946 ns/iter (+/- 7)

test uri/uri_128b ... bench:          19 ns/iter (+/- 3)
test uri/uri_256b ... bench:          44 ns/iter (+/- 1)
test uri/uri_512b ... bench:         116 ns/iter (+/- 1)
test uri/uri_1024b ... bench:         237 ns/iter (+/- 3)
test uri/uri_2048b ... bench:         465 ns/iter (+/- 3)
test uri/uri_4096b ... bench:         937 ns/iter (+/- 58)

after:
test header/value_128b ... bench:           30 ns/iter (+/- 1)
test header/value_256b ... bench:           39 ns/iter (+/- 1)
test header/value_512b ... bench:           55 ns/iter (+/- 2)
test header/value_1024b ... bench:          88 ns/iter (+/- 4)
test header/value_2048b ... bench:          193 ns/iter (+/- 49)
test header/value_4096b ... bench:          300 ns/iter (+/- 4)

test uri/uri_128b ... bench:          12 ns/iter (+/- 3)
test uri/uri_256b ... bench:          20 ns/iter (+/- 0)
test uri/uri_512b ... bench:          35 ns/iter (+/- 1)
test uri/uri_1024b ... bench:          65 ns/iter (+/- 4)
test uri/uri_2048b ... bench:         127 ns/iter (+/- 2)
test uri/uri_4096b ... bench:         270 ns/iter (+/- 36)