Closed AaronO closed 1 year ago
cc @Noah-Kennedy
Small enum in lieu of func ptr is marginally better thanks to branch-prediction, observed on header/count:
test header/count_1 ... bench: 21 ns/iter (+/- 5)
test header/count_2 ... bench: 35 ns/iter (+/- 5)
test header/count_4 ... bench: 66 ns/iter (+/- 2)
test header/count_8 ... bench: 130 ns/iter (+/- 53)
test header/count_16 ... bench: 259 ns/iter (+/- 80)
test header/count_32 ... bench: 499 ns/iter (+/- 43)
test header/count_64 ... bench: 978 ns/iter (+/- 195)
test header/count_128 ... bench: 1938 ns/iter (+/- 116)
@seanmonstar Squashed to a single commit cleanup: simd runtime detection
, since it's more of a cleanup than a perf improvement as we reverted to the atomic (which shouldn't be an issue in absolute but I would rather fine tune minimizing overhead of runtime feature detection in a separate PR)
I know when I originally added SIMD support to this crate, the is_x86_feature_detected!
macro did not get inlined, so the function call was slower than caching in an atomic locally. Inline attributes were later added, so it could be that the cache is no longer worth keeping. Would be good to measure.
I know when I originally added SIMD support to this crate, the
is_x86_feature_detected!
macro did not get inlined, so the function call was slower than caching in an atomic locally. Inline attributes were later added, so it could be that the cache is no longer worth keeping. Would be good to measure.
I did assembly dumps and it is inlined. It still requires more finetuning and analysis that I think would be best addressed in its own PR.
Also cleanup, builds off #131
We can see the overhead improvements in uri parsing for smaller values (where overhead is relatively significant) and we can see it compound in header/count accumulating the overhead of jumping in & out of SIMD.
header/count
uri