Performance of `zlib-rs` compared to `zlib-ng`

c410-f3r commented 3 weeks ago

The discussion feature is not enabled in this repository so I will just post here.

In hopes of removing C-related stuff in one of my projects I measured the performance of zlib-rs and zlib-ng in a testing suite called "fuzzingclient" from the well-known https://github.com/crossbario/autobahn-python.

Turns out that zlib-ng was roughly ~80% faster scoring an simple arithmetic mean of 1450 against 2608 from zlib-rs. Most tests showed similar results but some had very large discrepancies.

All files are available at https://filebin.net/p1h63q4zcy12s9hb.

folkertdev commented 3 weeks ago

I'll need to look into this in detail of course, but what sort of system are you on? This is relevant for what simd instructions are used which really dominate performance

c410-f3r commented 3 weeks ago

Tests were run with --release only. Not sure if other things like panic = 'abort' or lto = true will make a difference.

$ uname -a
Linux pc 6.8.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/cpuinfo 
model name  : AMD Ryzen 9 5900X 12-Core Processor
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap

Congratulations for the project!

c410-f3r commented 3 weeks ago

Erhhh... Sorry...

You commented SIMD and I remembered the absence of RUSTFLAGS='-C target-cpu=native'.

Benchmarks were run again and the results are... good! zlib-rs scored 1378 and zlib-ng scored 1369, technically tied.

Apart from target-cpu=native, other parameters were also applied.

[profile.benchmark]
codegen-units = 1
debug = false
debug-assertions = false
incremental = false
inherits = "release"
lto = true
opt-level = 3
overflow-checks = false
panic = 'abort'
rpath = false
strip = "symbols"

Files are available at https://filebin.net/uzqhwdqu2iuqix0j.

folkertdev commented 3 weeks ago

excellent! yes I have RUSTFLAGS='-C target-cpu=native' in my .cargo/config.toml so often forget about it. We have so far mostly been benchmarking with large files (several Mb) so we might get some value out of looking at smaller inputs. Also it is surprising that performance takes such a hit: we should still do runtime SIMD detection that gets cached. it will be slower but should not be 2x.

How did you actually run these benchmarks? Is that an easy process to reproduce?

c410-f3r commented 3 weeks ago

Hum... It is a WebSocket benchmark suite, so network-related things can mess up profiling attempts.

But since you insist :)

git clone https://github.com/c410-f3r/wtx
cd wtx
.scripts/autobahn-fuzzingclient.sh
xdg-open .scripts/autobahn/reports/fuzzingclient/index.html

In wtx/Cargo.toml, change flate2 = { default-features = false, features = ["zlib-rs"], optional = true, version = "1.0" } to flate2 = { default-features = false, features = ["zlib-ng"], optional = true, version = "1.0" } or any other backend.
In .scripts/autobahn-fuzzingclient.sh, change -v .scripts/autobahn/fuzzingclient-min.json:/fuzzingclient.json:ro to -v .scripts/autobahn/fuzzingclient.json:/fuzzingclient.json:ro to test all compression-related cases.

That is a lot of work. You probably shouldn't bother now that we know how well zlig-rs performs compared to zlib-ng in this particular scenario.

memorysafety / zlib-rs

Performance of `zlib-rs` compared to `zlib-ng` #123