seanmonstar / httparse

A push parser for the HTTP 1.x protocol in Rust.
https://docs.rs/httparse
Apache License 2.0
567 stars 111 forks source link

cleanup: SIMD runtime detection #132

Closed AaronO closed 1 year ago

AaronO commented 1 year ago

Also cleanup, builds off #131

We can see the overhead improvements in uri parsing for smaller values (where overhead is relatively significant) and we can see it compound in header/count accumulating the overhead of jumping in & out of SIMD.

header/count

1 2 4 8 16 32 64 128
Before 22 39 77 144 283 578 1092 2159
After 21 37 71 135 271 568 1025 2034

uri

1b 2b 4b 8b 16b 32b 64b 128b 256b 512b 1024b 2048b 4096b
Before 7 8 9 11 8 6 7 11 19 34 67 127 270
After 5 5 7 9 6 5 6 9 20 31 60 119 255
seanmonstar commented 1 year ago

cc @Noah-Kennedy

AaronO commented 1 year ago

Small enum in lieu of func ptr is marginally better thanks to branch-prediction, observed on header/count:

test header/count_1 ... bench:          21 ns/iter (+/- 5)
test header/count_2 ... bench:          35 ns/iter (+/- 5)
test header/count_4 ... bench:          66 ns/iter (+/- 2)
test header/count_8 ... bench:         130 ns/iter (+/- 53)
test header/count_16 ... bench:         259 ns/iter (+/- 80)
test header/count_32 ... bench:         499 ns/iter (+/- 43)
test header/count_64 ... bench:         978 ns/iter (+/- 195)
test header/count_128 ... bench:        1938 ns/iter (+/- 116)
AaronO commented 1 year ago

@seanmonstar Squashed to a single commit cleanup: simd runtime detection, since it's more of a cleanup than a perf improvement as we reverted to the atomic (which shouldn't be an issue in absolute but I would rather fine tune minimizing overhead of runtime feature detection in a separate PR)

seanmonstar commented 1 year ago

I know when I originally added SIMD support to this crate, the is_x86_feature_detected! macro did not get inlined, so the function call was slower than caching in an atomic locally. Inline attributes were later added, so it could be that the cache is no longer worth keeping. Would be good to measure.

AaronO commented 1 year ago

I know when I originally added SIMD support to this crate, the is_x86_feature_detected! macro did not get inlined, so the function call was slower than caching in an atomic locally. Inline attributes were later added, so it could be that the cache is no longer worth keeping. Would be good to measure.

I did assembly dumps and it is inlined. It still requires more finetuning and analysis that I think would be best addressed in its own PR.