seanmonstar / httparse

A push parser for the HTTP 1.x protocol in Rust.
https://docs.rs/httparse
Apache License 2.0
573 stars 113 forks source link

use SSE4.2 and AVX2 instructions when available #40

Closed seanmonstar closed 6 years ago

seanmonstar commented 6 years ago

This takes over where #38 left off (huge thanks @kamyuentse), and gets it working on 1.27.0, while keeping our minimum compiler version (1.10.0!). It might seem a little weird, so I'll explain how it's doing both runtime and compiletime detection for maximum performance.

Runtime Detection

The stable feature in Rust 1.27 includes is_x86_feature_enabled!, which allows checking if a certain target feature is enabled. Internally, it uses the unstable cfg(target_feature), but can also query the CPU at runtime. As of 1.27, the runtime check isn't inlined, which means that adding SIMD support was actually slower than with it disabled.

A patch to the stdsimd crate has already landed to include checks, but in the mean time, httparse uses its own inlined cache. After querying the macros once, the feature set is stored in a local atomic, and checking it results in an overall speed improvement!

However, by using this cache, it actually interferes slightly with optimizations the compiler could do if compiled with target_cpu=native. That's because the macro internally uses cfg(target_feature), and when that is set, the entire branch can be eliminated.

Compile-time detection

So, we already have a win with runtime detection. This also includes support to use compile time detection, even though it isn't stable in Rust 1.27! It takes advantage of the fact that cargo includes a CARGO_CFG_TARGET_FEATURE environment variable exposed to build scripts.

So, the new build script also looks for that environment variable, and if it detects that someone is compiling with certains features we can use (either sse4.2 or avx2), that information is emitted in custom httparse cfg options.

Then, the compilation of httparse will use a version that doesn't use our cached feature detection, and just uses is_x86_feature_enabled! directly. Since we saw before that the feature has been enabled, this will in most cases mean the branch is eliminated entirely.

Both runtime and compile-time detection in httparse can be disabled, though it is currently meant for testing (to be able to run the tests with all the various parsing methods in CI).

Benchmark improvements

Pre-1.27 (or when specifically configured SIMD off)

bench_httparse       ... bench:    529 ns/iter (+/- 13) = 1328 MB/s
bench_httparse_short ... bench:    66 ns/iter (+/- 1) = 1030 MB/s
bench_pico           ... bench:    492 ns/iter (+/- 11) = 1428 MB/s
bench_pico_short     ... bench:    72 ns/iter (+/- 3) = 944 MB/s 

1.27 with runtime detection (and my CPU has SSE4.2):

bench_httparse       ... bench:    451 ns/iter (+/- 16) = 1558 MB/s
bench_httparse_short ... bench:    70 ns/iter (+/- 8) = 971 MB/s
bench_pico           ... bench:    492 ns/iter (+/- 11) = 1428 MB/s
bench_pico_short     ... bench:    72 ns/iter (+/- 3) = 944 MB/s 

1.27 when setting -C target_cpu=native (and my CPU has SSE4.2):

bench_httparse       ... bench:    405 ns/iter (+/- 23) = 1735 MB/s
bench_httparse_short ... bench:    62 ns/iter (+/- 1) = 1096 MB/s
bench_pico           ... bench:    492 ns/iter (+/- 11) = 1428 MB/s
bench_pico_short     ... bench:    72 ns/iter (+/- 3) = 944 MB/s 

Takeaways

kamyuentse commented 6 years ago

@seanmonstar Awesome! Just post a pic related to the approach we used here, maybe this will be helpful later.

simd