Closed AaronO closed 2 months ago
@seanmonstar @lucab I'll bench this against for #175 for completeness sake, but this is substantially simpler/focused, should not regress arch64
/etc... and should be faster overall.
TLDR: this demonstrates #181 provides the bulk of #175's benefits, with trivial/minimal focused changes on the core sse42/avx2 interplay issue.
> critcmp -t=5 pr-175 pr-181
group pr-175 pr-181
----- ------ ------
header/count_128 1.18 1513.1±353.72ns 485.3 MB/sec 1.00 1277.3±64.41ns 574.9 MB/sec
header/name_1024b 1.00 310.5±7.39ns 3.1 GB/sec 1.08 335.8±10.04ns 2.9 GB/sec
header/name_4096b 1.00 1145.4±22.27ns 3.3 GB/sec 1.11 1266.2±35.40ns 3.0 GB/sec
header/value_1024b 1.12 60.6±7.16ns 15.8 GB/sec 1.00 54.0±6.61ns 17.8 GB/sec
header/value_2048b 1.19 111.3±13.59ns 17.2 GB/sec 1.00 93.9±5.99ns 20.4 GB/sec
header/value_4096b 1.17 228.6±54.54ns 16.7 GB/sec 1.00 194.8±52.47ns 19.6 GB/sec
method/custom 1.07 4.7±0.25ns 3.8 GB/sec 1.00 4.4±0.23ns 4.1 GB/sec
method/delete 1.08 4.7±0.46ns 3.7 GB/sec 1.00 4.4±0.31ns 4.0 GB/sec
method/head 1.13 3.9±0.61ns 4.1 GB/sec 1.00 3.4±0.15ns 4.6 GB/sec
method/patch 1.07 5.1±1.45ns 3.3 GB/sec 1.00 4.8±1.25ns 3.5 GB/sec
req_short/req_short 1.00 49.4±1.69ns 1312.6 MB/sec 1.14 56.2±8.88ns 1154.1 MB/sec
resp/resp 1.00 177.3±10.63ns 3.7 GB/sec 1.11 196.1±4.99ns 3.3 GB/sec
resp_short/resp_short 1.00 56.4±9.72ns 1539.2 MB/sec 1.32 74.2±42.30ns 1169.3 MB/sec
uri/uri_1024b 1.09 38.2±4.97ns 25.0 GB/sec 1.00 35.2±1.76ns 27.1 GB/sec
uri/uri_2048b 1.18 76.4±2.88ns 25.0 GB/sec 1.00 64.8±1.26ns 29.4 GB/sec
uri/uri_4096b 1.08 141.6±8.17ns 26.9 GB/sec 1.00 131.3±4.65ns 29.1 GB/sec
@lucab @seanmonstar I think we should land this, no regressions on aarch64, focused change on the core problem. Other improvements #175 provided can be explored in focused follow-ups.
For good measure, compared master
and #181 built with target-cpu=native
(basically exercising simd::avx2
bypassing simd::runtime
) to test perf upper-bound on x64:
> critcmp -t=5 main-native pr-181-native
group main-native pr-181-native
----- ----------- -------------
header/count_001 1.00 18.6±2.29ns 410.1 MB/sec 1.20 22.4±3.86ns 341.3 MB/sec
header/count_004 1.00 46.8±14.86ns 529.5 MB/sec 1.20 56.3±22.04ns 440.4 MB/sec
header/count_008 1.74 123.7±24.41ns 385.6 MB/sec 1.00 71.0±5.33ns 671.9 MB/sec
header/count_016 1.24 194.6±54.63ns 480.2 MB/sec 1.00 157.5±6.81ns 593.2 MB/sec
header/count_032 1.21 367.6±78.38ns 503.4 MB/sec 1.00 303.4±10.46ns 609.9 MB/sec
header/count_064 1.13 664.8±107.64ns 553.7 MB/sec 1.00 589.3±15.97ns 624.7 MB/sec
header/count_128 1.17 1388.5±303.20ns 528.9 MB/sec 1.00 1185.1±59.27ns 619.6 MB/sec
header/name_0004b 1.00 19.0±2.72ns 552.0 MB/sec 1.09 20.7±3.19ns 507.3 MB/sec
header/name_0064b 1.00 36.3±2.35ns 1863.8 MB/sec 1.08 39.2±9.02ns 1728.0 MB/sec
header/name_0128b 1.05 57.0±3.22ns 2.2 GB/sec 1.00 54.0±2.86ns 2.3 GB/sec
header/name_0256b 1.00 98.6±5.44ns 2.5 GB/sec 1.06 104.9±21.57ns 2.3 GB/sec
header/name_0512b 1.07 191.4±16.16ns 2.5 GB/sec 1.00 179.5±5.83ns 2.7 GB/sec
header/name_1024b 1.07 358.0±16.83ns 2.7 GB/sec 1.00 333.3±14.38ns 2.9 GB/sec
header/name_4096b 1.15 1450.0±259.57ns 2.6 GB/sec 1.00 1257.4±46.92ns 3.0 GB/sec
header/value_0004b 1.00 19.4±2.06ns 539.9 MB/sec 1.15 22.4±6.94ns 468.5 MB/sec
header/value_0008b 1.00 19.0±2.06ns 753.7 MB/sec 1.37 26.0±7.92ns 550.0 MB/sec
header/value_0256b 1.00 50.9±9.02ns 4.8 GB/sec 1.13 57.6±56.91ns 4.3 GB/sec
method/custom 1.00 4.7±0.27ns 3.7 GB/sec 1.15 5.4±1.09ns 3.2 GB/sec
method/delete 1.00 4.7±0.21ns 3.8 GB/sec 1.11 5.2±1.16ns 3.4 GB/sec
method/patch 1.00 4.6±1.08ns 3.6 GB/sec 1.06 4.9±1.50ns 3.4 GB/sec
method/post 1.33 2.9±0.48ns 5.5 GB/sec 1.00 2.2±0.08ns 7.3 GB/sec
method/trace 1.00 4.5±0.97ns 3.8 GB/sec 1.07 4.8±1.40ns 3.5 GB/sec
resp/resp 1.07 217.5±12.68ns 3.0 GB/sec 1.00 204.0±5.14ns 3.2 GB/sec
uri/uri_0001b 1.80 5.6±0.24ns 340.5 MB/sec 1.00 3.1±0.14ns 612.8 MB/sec
uri/uri_0002b 1.83 6.8±1.30ns 418.4 MB/sec 1.00 3.7±0.18ns 765.2 MB/sec
uri/uri_0004b 1.46 8.4±0.24ns 569.1 MB/sec 1.00 5.7±0.80ns 833.5 MB/sec
uri/uri_0008b 1.81 5.6±0.35ns 1524.5 MB/sec 1.00 3.1±0.06ns 2.7 GB/sec
uri/uri_0016b 1.55 7.8±2.70ns 2.0 GB/sec 1.00 5.1±1.61ns 3.1 GB/sec
uri/uri_0032b 1.71 5.9±0.26ns 5.2 GB/sec 1.00 3.5±0.29ns 8.9 GB/sec
uri/uri_0064b 1.56 6.8±0.34ns 8.9 GB/sec 1.00 4.4±0.30ns 13.9 GB/sec
uri/uri_0128b 1.25 11.9±0.46ns 10.1 GB/sec 1.00 9.5±0.41ns 12.6 GB/sec
uri/uri_0256b 1.27 28.0±1.08ns 8.5 GB/sec 1.00 22.1±1.66ns 10.8 GB/sec
uri/uri_0512b 1.18 64.6±12.17ns 7.4 GB/sec 1.00 54.6±1.71ns 8.7 GB/sec
uri/uri_2048b 1.00 350.4±9.02ns 5.4 GB/sec 1.16 406.1±4.22ns 4.7 GB/sec
version/partial 1.00 3.0±0.10ns 2.2 GB/sec 1.05 3.1±0.07ns 2.1 GB/sec
TLDR: 2-line change => 2x faster
req/req
(doesn't raise the ceiling substantially but fixes perf issue of generic x64 build, using runtime dispatch)This has massive implications on the default
simd::runtime::*
(x64 generic build) perf, improving how the code is lowered/inlined. (Falling back to SSE4.2 for a handful of bytes was wasteful).Should supersede #175, #156
Benchmarks on GH CodeSpace (4-core / 16GB)
(4 cores of a 64-core AMD EPYC 7763 host CPU)