cleanup: SIMD runtime detection

seanmonstar / httparse

A push parser for the HTTP 1.x protocol in Rust.

https://docs.rs/httparse

Apache License 2.0

567 stars 111 forks source link

cleanup: SIMD runtime detection #132

Closed AaronO closed 1 year ago

AaronO commented 1 year ago

Also cleanup, builds off #131

We can see the overhead improvements in uri parsing for smaller values (where overhead is relatively significant) and we can see it compound in header/count accumulating the overhead of jumping in & out of SIMD.

header/count

	1	2	4	8	16	32	64	128
Before	22	39	77	144	283	578	1092	2159
After	21	37	71	135	271	568	1025	2034

uri

	1b	2b	4b	8b	16b	32b	64b	128b	256b	512b	1024b	2048b	4096b
Before	7	8	9	11	8	6	7	11	19	34	67	127	270
After	5	5	7	9	6	5	6	9	20	31	60	119	255

seanmonstar commented 1 year ago

cc @Noah-Kennedy

AaronO commented 1 year ago

Small enum in lieu of func ptr is marginally better thanks to branch-prediction, observed on header/count:

test header/count_1 ... bench:          21 ns/iter (+/- 5)
test header/count_2 ... bench:          35 ns/iter (+/- 5)
test header/count_4 ... bench:          66 ns/iter (+/- 2)
test header/count_8 ... bench:         130 ns/iter (+/- 53)
test header/count_16 ... bench:         259 ns/iter (+/- 80)
test header/count_32 ... bench:         499 ns/iter (+/- 43)
test header/count_64 ... bench:         978 ns/iter (+/- 195)
test header/count_128 ... bench:        1938 ns/iter (+/- 116)

AaronO commented 1 year ago

@seanmonstar Squashed to a single commit cleanup: simd runtime detection, since it's more of a cleanup than a perf improvement as we reverted to the atomic (which shouldn't be an issue in absolute but I would rather fine tune minimizing overhead of runtime feature detection in a separate PR)

seanmonstar commented 1 year ago

I know when I originally added SIMD support to this crate, the is_x86_feature_detected! macro did not get inlined, so the function call was slower than caching in an atomic locally. Inline attributes were later added, so it could be that the cache is no longer worth keeping. Would be good to measure.

AaronO commented 1 year ago

I know when I originally added SIMD support to this crate, the is_x86_feature_detected! macro did not get inlined, so the function call was slower than caching in an atomic locally. Inline attributes were later added, so it could be that the cache is no longer worth keeping. Would be good to measure.

I did assembly dumps and it is inlined. It still requires more finetuning and analysis that I think would be best addressed in its own PR.