rust-lang / flate2-rs

DEFLATE, gzip, and zlib bindings for Rust
https://docs.rs/flate2
Apache License 2.0
891 stars 158 forks source link

Extremely slow performance in debug mode with default backend #297

Closed edmorley closed 1 year ago

edmorley commented 2 years ago

Hi!

In a particular project, I use flate2 to decompress a ~50MB gzipped tarfile.

Whilst in production the project will be built in release mode, the integration tests are performed using debug builds, and when iterating locally when developing, I use debug builds too.

In addition, due to the nature of the project (a Cloud Native Buildpack that's targetting x86_64 Linux), these integration tests/any manual testing have to run inside a x86_64 Docker container. After recently obtaining a new Macbook Pro M1 Max (which has to use Docker's qemu emulation for x86_64 Docker images), I was surprised to see the integration tests take considerably longer than they used to on my much older machine.

Investigating, it turns out that when using the default flate2 backend of miniz_oxide and the below testcase:

In contrast, when using the zlib or zlib-ng-compat backends, debug builds are only 2-4x slower than release builds.

Whilst debug builds are expected to be slower than release builds, I was quite surprised that they were 30-60x slower for this crate using the default backend.

I'm presuming there's not much that can be done to improve performance of miniz_oxide for debug builds, however I was wondering if it would be worth mentioning the performance issues in this crates docs, particularly given that: (a) switching backends makes such a difference here, (b) the docs currently suggest that the default backend is mostly "good enough" (and otherwise I would have tried another backend sooner):

There’s various tradeoffs associated with each implementation, but in general you probably won’t have to tweak the defaults.

(from https://docs.rs/flate2/latest/flate2/#implementation)

It was only later that I noticed this section in the readme (that's not on docs.rs), that seemed to imply the zlib-ng backend was actually faster: https://github.com/rust-lang/flate2-rs#backends

Testcase:

use flate2::read::GzDecoder;
use std::fs::File;

fn main() -> Result<(), std::io::Error> {
    // Archive is from:
    // https://heroku-buildpack-python.s3.amazonaws.com/heroku-20/runtimes/python-3.10.3.tar.gz
    let archive = File::open("python-3.10.3.tar.gz")?;
    let mut destination = tempfile::tempfile()?;
    let mut decoder = GzDecoder::new(archive);
    std::io::copy(&mut decoder, &mut destination)?;

    Ok(())
}
[package]
name = "testcase-flate2-debug"
version = "0.1.0"
edition = "2021"

[dependencies]
# For default backend
flate2 = "1.0.22"
# For alternate backends
# flate2 = { version = "1.0.22", features = ["zlib-ng-compat"], default-features = false }
# flate2 = { version = "1.0.22", features = ["zlib"], default-features = false }
tempfile = "3.3.0"

Results:

Backend Architecture Wall time w/release build Wall time w/debug build Debug slowdown
miniz_oxide (default) Native ARM64 0.69s 21.55s 31x
miniz_oxide (default) AMD64 under qemu 3.41s 207s 60x
zlib Native ARM64 0.65s 1.26s 1.9x
zlib AMD64 under qemu 2.19s 9.22s 4.2x
zlib-ng-compat Native ARM64 0.55s 1.43s 2.6x
zlib-ng-compat AMD64 under qemu ??? ??? ???

(The missing timings for zlib-ng-compat under qemu is due to cross-compilation of zlib-ng currently failing: https://github.com/rust-lang/libz-sys/issues/93)

oyvindln commented 2 years ago

Yeah rust in debug mode is going to be much much much slower than anything written in C due to the nature of the languages. (and I'm not sure whether system zlib will even be used in debug/no optimization mode.)

Turning on the first level of optimizations in debug mode may help a fair bit, may be some other workarounds to avoid compiling all deps in debug mode or using different opts for main project/deps but not sure.

edmorley commented 2 years ago

I wasn't able to get perf working inside a QEMU'd Docker container (due to PERF_FLAG_FD_CLOEXEC not implemented errors), so wasn't able to profile the 207s chronic case unfortunately.

However, this is a flamegraph for a native ARM64 debug build (the 21.55s entry in the table above): (It has to be downloaded for the interactivity to work; hosted on GitHub that is disabled)

flamegraph-debug-native-arm64

As can be seen, 77% of the profile is in Adler32::compute(): https://github.com/jonas-schievink/adler/blob/a94f525f62698d699d1fb3cc9112db8c35662b16/src/algo.rs#L5-L107

With 60% of the total profile within the implementation of AddAssign<Self> for U32X4 (used from Adler32::compute()): https://github.com/jonas-schievink/adler/blob/a94f525f62698d699d1fb3cc9112db8c35662b16/src/algo.rs#L124-L130

messense commented 2 years ago

You can override opt-level for certain crates in debug mode, see https://doc.rust-lang.org/cargo/reference/profiles.html#overrides, add the following to Cargo.toml should make it faster.

[profile.dev.package.miniz_oxide]
opt-level = 3
JohnTitor commented 1 year ago

Closing as this is more of a Rust issue rather than a flate2-specific one.