rust-lang / flate2-rs

DEFLATE, gzip, and zlib bindings for Rust
https://docs.rs/flate2
Apache License 2.0
929 stars 163 forks source link

question: Slowdown after upgrading from 1.0.26 to 1.0.28 #395

Closed ivpavici closed 10 months ago

ivpavici commented 10 months ago

Hello!

This will be more of a question, sorry in advance if something is missing, but I'm not sure what details might help!

Anyway, not sure why we are experiencing a slowdown (> x2) while running tests in our project locally after bumping flate2 from 1.0.26 to 1.0.28. Everything else in the project stays the same!

Cargo.toml: flate2 = { version = "1.0.28" }

Cargo.lock:

[[package]]
name = "flate2"
version = "1.0.28"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "46303f565772937ffe1d394a4fac6f411c6013172fadde9dcdb1e147a086940e"
dependencies = [
 "crc32fast",
 "miniz_oxide",
]

The problem is that when we run the same tests on github actions there is no slowdown between these versions!

I personally am running tests on Ubuntu on WSL2... I can reproduce this behavior every time between versions!

### More information needed
- [ ] Consider trying different backends to see if it reproduces there as well
- [ ] Bisect with local clone to find offending commit
- [ ] See if a fix is possible
Byron commented 10 months ago

This configuration would mean that miniz_oxide is used, a pure Rust implementation. The miniz_oxide dependency didn't change in a while, so it looks like the cause should either be something in flate2, or a difference in the Rust compiler (unless you say it wasn't upgraded either).

In any case, I recommend to try 1.0.27, and once the first good and the first bad tag are found, I'd use a local clone of flate2 and set it up to be used with

[patch."crates-io"]
flate2 = { path = "/path/to/clone" }

Then git bisect can be used in the flate2 clone to find the commit that introduced the issue. With zlib-ng I didn't notice any issue on my side, but miniz-oxide is certainly different enough.

As this issue can't be reproduced here, I recommend closing it while considering the submission of a PR for a fix that works for you.

ivpavici commented 10 months ago

thank you @Byron !

We have identified why there is a slowdown locally vs on github actions... because locally I ran my project just with: cargo run ...and seems that this doesn't affect version 1.0.26

But, on github, there the package is built with --release flag (and running inside docker). And here seems to be a difference for 1.0.28! It is 2 or 3 times slower when just ran by cargo run instead of cargo run --release

Locally , it seems that running both with --release results in same execution time:

cargo run --release with 1.0.26 Time: 268.679 s, estimated 944 s

cargo run --release with 1.0.28 Time: 267.24 s, estimated 269 s

Byron commented 10 months ago

Thanks for digging in and for solving the puzzle :). I am glad it isn't anything more serious.

jongiddy commented 10 months ago

A good candidate for the cause of the slowdown is #373. That PR explicitly prioritized correctness over performance, so it is positive that the times in release mode do not show any change.

Hexta commented 9 months ago

Hey, Unfortunately, #373 caused issues in release mode as well. One of the related issue is https://github.com/vectordotdev/vector/issues/19981.

Byron commented 9 months ago

Thanks for bringing this to my attention!

I took a look at the flamegraph in https://github.com/vectordotdev/vector/issues/19981 and noticed that the runtime is dominated by calls to an unaligned memset function. This is probably caused by the resize(cap, 0) call, which fills the output vec with zeroes. This didn't happen before the change in #373. However, this also means that the output vector isn't reused as much as it maybe should, or that the length of the output vector is changed back to below its capacity, so zeroes are repeatedly written.

Something I couldn't make sense of is this claim in the vector issue:

After version v0.34.0, the memory allocation system call count is very high when to do GzEncoder (I also confirm that this call stack occur when I using the zlib compression rather than gzip)

The change in #373 definitely does not allocate (or deallocate). The only change really is that it can fill unused capacity with zeroes the first time.

Thus I think it's more about an issue in the usage of output vectors during compression in vector that now becomes more prominent. It looks like this comment seems to indicate something like it as well.

I'd wait and see how vector ends up solving the issue, in order to figure out if anything can be done here to prevent issues of the same kind in future.

Byron commented 9 months ago

The current implementation indeed suffers from a performance degradation when exposed to many small writes during compression.

The problem is that the internal buffer is created with 32kb capacity. However, with each small write it's memset to its full capacity, just to be truncated to what's actually written right after.

These calls accumulate to something very costly. The solution is typically to avoid many small writes by buffering them, so in a way, leaving this as is seems like a net-positive as it can reveal issues with its usage, while being easy to fix in the caller's code.

Shatur commented 6 months ago

I also noticed the performance regression when updating from 1.0.27 to 1.0.28. In my case release doesn't help and it's ~100 times slower...