Open errantmind opened 3 years ago
I've also been experimenting with getting inlining to work across the FFI, and succeeded using Rusts' 'linker-plugin-lto', clang-12, and lld-12. This improved the benchmarks for pico a little more and put both pico benchmarks in the lead, the full pico benchmark hitting ~2900 MB/s vs httparse at 1751 MB/s on my ancient laptop.
Ah yea good point. Originally httparse didn't have SIMD support either, so it was more similar.
I haven't looked all that far into it but I'm interested in your thoughts on why Pico is faster. Is it doing some memory management tricks or something? ..I'm working on a pet project and am trying to figure out if I should just write it in C, or if there is a way to get comparable results with unsafe Rust
How do you run the Rust benchmarks? Do you set the target CPU so it doesn't have to do runtime checks? https://rust-lang.github.io/packed_simd/perf-guide/target-feature/rustflags.html
I run these flags globally in my config.toml:
rustflags=["-Ctarget-cpu=native","-Ctarget-feature=+sse4.2"]
I'm going to dump some info here for reproducibility purposes
The speed improvements came primarily from two areas, both involved modifying the underlying Pico bindings crate
-msse4
and -flto=thin
to the cc compile command)rustc --version --verbose
. Host information needed latersudo apt-get install clang-11 lld-11
export CC=/usr/bin/clang-11
(modify this location as needed by your dist)~/.cargo/config.toml
. Use the host information above. For me this is:
[target.x86_64-unknown-linux-gnu]
rustflags = [
"-Ctarget-cpu=native",
"-Clink-arg=-fuse-ld=lld",
"-Clinker=clang-11",
]
cargo clean && rm Cargo.lock
cargo bench
in benchmark crateFull cc command from Pico bindings crate:
cc::Build::new()
.file("extern/picohttpparser/picohttpparser.c")
.opt_level_str(&"fast")
.flag("-funroll-loops")
.flag("-msse4")
.flag("-flto=thin")
.flag("-march=native")
.compile("libpicohttpparser.a");
Updated the above comment as the steps it described were incorrect. The above steps work as expected. Here are the results of my latest test:
Alright, the adventure is coming to an end with this final update:
linker-plugin-lto
flag when building certain kinds of crates, like my sys crate in this example. This can be verified by passing verbose (i.e. cargo build --release --verbose
)
linker-plugin-lto
flag, just the sys crate (which is automatic). If all dependencies are built with linker-plugin-lto
, there is actually a loss of about 5% performance~/.cargo/config.toml
) is overwritten by setting RUSTFLAGS<project>/.cargo/config.toml
)clang-12
is significantly (~5%) faster than clang-11
for the pico tests (for some unknown reason). clang-13
(dev build), so far, is not significantly faster than clang-12
Hello, first, thanks for making this tool.
I wanted to point out your benchmark is a bit unfair as you compare httparse sse4 against picohttpparser without sse4. The reason picohttpparser doesn't have sse4 is because your dependency 'pico-sys' does not compile picohttpparser with sse4 enabled.
Your benchmark showed a ~60% improvement in performance for 'bench_pico' once sse4 was enabled in the underlying crate.
I forked the underlying crate 'pico-sys' and made a few modifications if you want to verify my results:
https://github.com/errantmind/rust-pico-sys