sstadick / gzp

Multi-threaded Compression
The Unlicense
155 stars 14 forks source link

gzp spawns one less threads than CPUs, which hurts performance #11

Closed Shnatsel closed 3 years ago

Shnatsel commented 3 years ago

I've run some tests comparing crabz to pigz using the benchmarking setup described in crabz readme. On a 4-core system with no hyperthreading crabz was measurably slower.

I've profiled both using perf and it turned out that crabz spends the vast majority of the time in zlib compression, so parallelization overhead is not an issue. However, crabz only spawned 3 threads performing compression while pigz spawned 4 compression threads. After passing -p3 to pigz so that it would only spawn 3 compression threads, the compression time became identical to crabz.

I suspect this is also why you're not seeing any parallelization gains on dual-core systems.

Technical details

crabz profile: https://share.firefox.dev/3zeVRxN pigz profile: https://share.firefox.dev/2WeYe4V

crabz installed via cargo install crabz on a clean Ubuntu 20.04 installation, pigz installed via apt.

$ hyperfine 'crabz -c 3 < /media/elementary/ssd/large-file.txt' 'pigz -3 < /media/elementary/ssd/large-file.txt'
Benchmark #1: crabz -c 3 < /media/elementary/ssd/large-file.txt
  Time (mean ± σ):      4.642 s ±  0.351 s    [User: 11.326 s, System: 0.196 s]
  Range (min … max):    4.312 s …  5.465 s    10 runs

Benchmark #2: pigz -3 < /media/elementary/ssd/large-file.txt
  Time (mean ± σ):      3.884 s ±  0.253 s    [User: 14.307 s, System: 0.167 s]
  Range (min … max):    3.556 s …  4.248 s    10 runs

Summary
  'pigz -3 < /media/elementary/ssd/large-file.txt' ran
    1.20 ± 0.12 times faster than 'crabz -c 3 < /media/elementary/ssd/large-file.txt'
sstadick commented 3 years ago

Thanks for the detailed issue!

I think this comes down to accounting and the fact that crabz has not had threading tuned at all.

Looking at pigz, it oversubscribes threads: https://github.com/madler/pigz/blob/b6da942b9ca15eb9149837f07b2b3b6ff21d9845/pigz.c#L2206, in that it will spawn as many threads as their are cores + a writer thread + the main thread.

gzp accounts for the writer thread, subtracting 1 from the number of threads it is allowed to spawn for compression. I'm not sure what the best way to represent that is / if just hiding that writer thread from the gzp cost is more helpful to end users since you can oversubscribe like pigz.

Do you have any thoughts on what would make the most sense for gzp as an end user?

What I'm thinking at the moment is that I'll change the documentation / function names so that num_threads is explicitly setting the number of compression threads, and note that an additional thread is used for writing as well. I think that will better match users expectations when setting the number of threads.

This branch of crabz is using a branch of gzp where I've made the above changes and also added an option to specifically set the number of compression threads. If you still have your env set up and don't mind giving int a try I'd be curious to see if that evens our the performance.

Shnatsel commented 3 years ago

The threading_like_pigz branch is currently identical to main, could you make sure you've pushed the changes to github?

The profile linked above shows that the main thread and writer thread together do not occupy an entire core, so one core out of 4 ends up being mostly idle in my configuration. crc32 and writing are very fast, it seems. I believe it's best to match the number of compression threads to the number of CPUs, like pigz does already.

sstadick commented 3 years ago

Ah! Sorry about that, changes have been pushed.

Shnatsel commented 3 years ago

I don't have that exact setup anymore, but I've tried it on the same machine with a different Linux OS and the results are inconclusive: having more threads seems to help on higher compression ratios but hinder at lower ones. Benches, with the same Shakespeare file repeated 100 times:

shnatsel@shnatsel-desktop ~/Code> hyperfine 'crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):     18.458 s ±  0.240 s    [User: 45.843 s, System: 0.236 s]
  Range (min … max):   17.991 s … 18.732 s    10 runs

Benchmark #2: crabz-new/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):     16.890 s ±  0.217 s    [User: 48.440 s, System: 0.264 s]
  Range (min … max):   16.672 s … 17.254 s    10 runs

Summary
  'crabz-new/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt' ran
    1.09 ± 0.02 times faster than 'crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt'
shnatsel@shnatsel-desktop ~/Code> hyperfine 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.289 s ±  0.060 s    [User: 11.144 s, System: 0.191 s]
  Range (min … max):    4.135 s …  4.353 s    10 runs

Benchmark #2: crabz-new/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.781 s ±  0.074 s    [User: 12.626 s, System: 0.214 s]
  Range (min … max):    4.674 s …  4.871 s    10 runs

Summary
  'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt' ran
    1.11 ± 0.02 times faster than 'crabz-new/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt'

pigz 2.4 from Ubuntu 18.04 repos is still faster:

shnatsel@shnatsel-desktop ~/Code> hyperfine 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.342 s ±  0.102 s    [User: 11.203 s, System: 0.179 s]
  Range (min … max):    4.168 s …  4.466 s    10 runs

Benchmark #2: pigz -3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      3.491 s ±  0.041 s    [User: 13.664 s, System: 0.130 s]
  Range (min … max):    3.444 s …  3.569 s    10 runs

Summary
  'pigz -3 < /media/shnatsel/ssd/large-file.txt' ran
    1.24 ± 0.03 times faster than 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt'
Shnatsel commented 3 years ago

Perhaps the difference in performance comes down to the differences in the underlying zlib implementation? Is there a flag for crabz that I could use to force using a single compression thread?

sstadick commented 3 years ago

Weird. Here's my results, not in thread limited environment, but limiting with flags:

❯ hyperfine './target/release/crabz -c 3 -p 4 < ../gzp/bench-data/shakespeare.txt' 'pigz -3 -p 4 < ../gzp/bench-data/shakespeare.txt'
Benchmark #1: ./target/release/crabz -c 3 -p 4 < ../gzp/bench-data/shakespeare.txt
  Time (mean ± σ):      2.043 s ±  0.020 s    [User: 7.577 s, System: 0.156 s]
  Range (min … max):    2.017 s …  2.069 s    10 runs

Benchmark #2: pigz -3 -p 4 < ../gzp/bench-data/shakespeare.txt
  Time (mean ± σ):      2.981 s ±  0.014 s    [User: 12.209 s, System: 0.226 s]
  Range (min … max):    2.957 s …  3.007 s    10 runs

Summary
  './target/release/crabz -c 3 -p 4 < ../gzp/bench-data/shakespeare.txt' ran
    1.46 ± 0.02 times faster than 'pigz -3 -p 4 < ../gzp/bench-data/shakespeare.txt'

That's crabz off the same branch I liked before, I pushed on new commit that allows gzp to go down to 1 thread. There is now a -p flag for crabz that specified the number of compression threads it can use (run cargo update to clear the git cache for cargo).

Results for -p1

Benchmark #1: ./target/release/crabz -c 3 -p 1 < ../gzp/bench-data/shakespeare.txt
  Time (mean ± σ):      6.771 s ±  0.058 s    [User: 7.207 s, System: 0.126 s]
  Range (min … max):    6.681 s …  6.851 s    10 runs

Benchmark #2: pigz -3 -p 1 < ../gzp/bench-data/shakespeare.txt
  Time (mean ± σ):     11.420 s ±  0.158 s    [User: 11.336 s, System: 0.079 s]
  Range (min … max):   11.079 s … 11.605 s    10 runs

Summary
  './target/release/crabz -c 3 -p 1 < ../gzp/bench-data/shakespeare.txt' ran
    1.69 ± 0.03 times faster than 'pigz -3 -p 1 < ../gzp/bench-data/shakespeare.txt'

Even giving both crabz and pigz all my threads (32), on compression levels 3 and 9 crabz is 10-20% faster.

The default zlib library for crabz / gzp is ng-zlib, switching the feature flag to deflate_zlib or deflate_rust leads to no performance change with -p 4 for me.

So to get apples-to-apples (ish? does pigz link to system zlib?) zlib, change the gzp dep in crabz to:

gzp = { git = "https://github.com/sstadick/gzp", branch = "feature/allow_oversubscribed_writer", no-default-features = true, features = ["deflate_zlib"]}

I can't imagine that it matters that much, but what version of rust are you running? How did you install pigz? I am also running 2.4 from installed via apt on ubuntu 20.04

Shnatsel commented 3 years ago

Right now comparing with pigz 2.4 installed via apt on Ubuntu 18.04

For crabz I use a git checkout and then cargo build --release. I was using the Cargo.lock from the crabz repo up until now, but had to run cargo update to pull in new version of gzp.

> rustc --version --verbose 
rustc 1.52.1 (9bc8c42bb 2021-05-09)
binary: rustc
commit-hash: 9bc8c42bb2f19e745a63f3445f1ac248fb015e53
commit-date: 2021-05-09
host: x86_64-unknown-linux-gnu
release: 1.52.1
LLVM version: 12.0.0

Overcommit seems to help my 4-core system, but just barely:

> hyperfine 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.706 s ±  0.069 s    [User: 12.399 s, System: 0.201 s]
  Range (min … max):    4.619 s …  4.796 s    10 runs

Benchmark #2: crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.894 s ±  0.072 s    [User: 11.943 s, System: 0.198 s]
  Range (min … max):    4.804 s …  5.025 s    10 runs

Summary
  'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' ran
    1.04 ± 0.02 times faster than 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt'

In single-threaded mode crabz seems to run much faster than pigz:

> hyperfine 'crabz-new/target/release/crabz -p1 -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p1 -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p1 -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      8.735 s ±  0.153 s    [User: 9.703 s, System: 0.217 s]
  Range (min … max):    8.609 s …  9.120 s    10 runs

Benchmark #2: pigz -p1 -3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):     12.968 s ±  0.294 s    [User: 12.896 s, System: 0.066 s]
  Range (min … max):   12.757 s … 13.517 s    10 runs

Summary
  'crabz-new/target/release/crabz -p1 -c3 < /media/shnatsel/ssd/large-file.txt' ran
    1.48 ± 0.04 times faster than 'pigz -p1 -3 < /media/shnatsel/ssd/large-file.txt'

But pigz overtakes crabz when using 4 threads:

> hyperfine 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.682 s ±  0.106 s    [User: 12.331 s, System: 0.196 s]
  Range (min … max):    4.512 s …  4.848 s    10 runs

Benchmark #2: pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      3.526 s ±  0.058 s    [User: 13.691 s, System: 0.142 s]
  Range (min … max):    3.457 s …  3.643 s    10 runs

Summary
  'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt' ran
    1.33 ± 0.04 times faster than 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt'

Removing overcommit hurts performance slightly in case of crabz and significantly in case of pigz:

> hyperfine 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p3 -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.835 s ±  0.107 s    [User: 11.788 s, System: 0.193 s]
  Range (min … max):    4.642 s …  4.987 s    10 runs

Benchmark #2: pigz -p3 -3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.743 s ±  0.117 s    [User: 14.478 s, System: 0.173 s]
  Range (min … max):    4.586 s …  4.955 s    10 runs

Summary
  'pigz -p3 -3 < /media/shnatsel/ssd/large-file.txt' ran
    1.02 ± 0.03 times faster than 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt'

I've tried the Rust backend instead of zlib-ng and saw the exact same performance in both single-threaded and multi-threaded mode.

So I guess the actionable takeaways are:

  1. The overhead of crabz appears to be higher than that of pigz on my system, since the difference between 3 and 4 threads is so pronounced for pigz but barely exists for crabz. A profile of where crabz spends the time can be found here.
  2. Perhaps gzp should default to the 100% safe Rust backend for flate2, since performance is the same anyway.

I'll test a dual-core system next and see if 1 or 2 threads works best there.

sstadick commented 3 years ago

I just pushed a new commit to gzp. I realized that when I "fixed" the num_threads to be just compression threads I didn't re-adjust by queue sizes which are all based on the number of threads. So the queues were allowing for very little buffer to build up. Instead of just having the num_threads as the queuesize, I've made the queues 2*num_threads, which is the same as pigz.

This gave an appreciable performance bump on my system.

Building after cargo update should pull it in.

I agree on point 2 though. I want to re-test things no that work is getting to the compressors faster that the zlib library doesn't make a difference, but if it's narrow enough I'd rather have an all rust backend.

Thanks for sharing the profile info, looking at that now.

Shnatsel commented 3 years ago

On my quad-core Ryzen overcommit is a toss-up. However, preliminary results indicate that having 2 compression threads on a dual-core system increases performance dramatically. I'll post the full dual-core results shortly.

Full timings from the quad-core Ryzen with the buffer size changes:

> hyperfine -w3 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt'; hyperfine 'crabz-new/target/release/crabz -p4 -c9 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -p3 -c9 < /media/shnatsel/ssd/large-file.txt' 'crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt' 'pigz -p4 -9 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.396 s ±  0.061 s    [User: 12.319 s, System: 0.168 s]
  Range (min … max):    4.305 s …  4.483 s    10 runs

Benchmark #2: crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.291 s ±  0.158 s    [User: 11.156 s, System: 0.192 s]
  Range (min … max):    4.112 s …  4.616 s    10 runs

Benchmark #3: crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      4.386 s ±  0.096 s    [User: 11.316 s, System: 0.174 s]
  Range (min … max):    4.269 s …  4.572 s    10 runs

Benchmark #4: pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      3.527 s ±  0.016 s    [User: 13.820 s, System: 0.136 s]
  Range (min … max):    3.496 s …  3.552 s    10 runs

Summary
  'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt' ran
    1.22 ± 0.05 times faster than 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt'
    1.24 ± 0.03 times faster than 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt'
    1.25 ± 0.02 times faster than 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c9 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):     16.543 s ±  0.262 s    [User: 48.818 s, System: 0.196 s]
  Range (min … max):   16.094 s … 16.908 s    10 runs

Benchmark #2: crabz-new/target/release/crabz -p3 -c9 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):     17.457 s ±  0.174 s    [User: 44.624 s, System: 0.207 s]
  Range (min … max):   17.154 s … 17.742 s    10 runs

Benchmark #3: crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):     17.493 s ±  0.129 s    [User: 44.671 s, System: 0.197 s]
  Range (min … max):   17.338 s … 17.697 s    10 runs

Benchmark #4: pigz -p4 -9 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):     15.614 s ±  0.070 s    [User: 61.647 s, System: 0.168 s]
  Range (min … max):   15.521 s … 15.754 s    10 runs

Summary
  'pigz -p4 -9 < /media/shnatsel/ssd/large-file.txt' ran
    1.06 ± 0.02 times faster than 'crabz-new/target/release/crabz -p4 -c9 < /media/shnatsel/ssd/large-file.txt'
    1.12 ± 0.01 times faster than 'crabz-new/target/release/crabz -p3 -c9 < /media/shnatsel/ssd/large-file.txt'
    1.12 ± 0.01 times faster than 'crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt'
Shnatsel commented 3 years ago

Having 2 compression threads instead of 1 seems to be greatly beneficial on a dual-core system.

On a dual-core AMD Stoney Ridge system crabz with 2 compression threads beats pigz with a large margin:

hyperfine -w3 'target/release/crabz -p1 -c3 < ~/shakespeare_50_times.txt' 'target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'pigz -p2 -3 < ~/shakespeare_50_times.txt'; hyperfine 'target/release/crabz -p1 -c9 < ~/shakespeare_50_times.txt' 'target/release/crabz -p2 -c9 < ~/shakespeare_50_times.txt' 'pigz -p2 -9 < ~/shakespeare_50_times.txt'
Benchmark #1: target/release/crabz -p1 -c3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):     12.761 s ±  0.250 s    [User: 13.368 s, System: 0.323 s]
  Range (min … max):   12.391 s … 12.958 s    10 runs

Benchmark #2: target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):      8.483 s ±  0.060 s    [User: 15.180 s, System: 0.456 s]
  Range (min … max):    8.415 s …  8.604 s    10 runs

Benchmark #3: pigz -p2 -3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):     10.901 s ±  0.010 s    [User: 21.267 s, System: 0.363 s]
  Range (min … max):   10.885 s … 10.914 s    10 runs

Summary
  'target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' ran
    1.29 ± 0.01 times faster than 'pigz -p2 -3 < ~/shakespeare_50_times.txt'
    1.50 ± 0.03 times faster than 'target/release/crabz -p1 -c3 < ~/shakespeare_50_times.txt'
Benchmark #1: target/release/crabz -p1 -c9 < ~/shakespeare_50_times.txt
  Time (mean ± σ):     55.027 s ±  0.527 s    [User: 55.533 s, System: 0.335 s]
  Range (min … max):   54.150 s … 56.303 s    10 runs

Benchmark #2: target/release/crabz -p2 -c9 < ~/shakespeare_50_times.txt
  Time (mean ± σ):     35.056 s ±  0.305 s    [User: 60.079 s, System: 0.564 s]
  Range (min … max):   34.556 s … 35.766 s    10 runs

Benchmark #3: pigz -p2 -9 < ~/shakespeare_50_times.txt
  Time (mean ± σ):     50.521 s ±  1.243 s    [User: 98.581 s, System: 0.398 s]
  Range (min … max):   49.524 s … 52.373 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  'target/release/crabz -p2 -c9 < ~/shakespeare_50_times.txt' ran
    1.44 ± 0.04 times faster than 'pigz -p2 -9 < ~/shakespeare_50_times.txt'
    1.57 ± 0.02 times faster than 'target/release/crabz -p1 -c9 < ~/shakespeare_50_times.txt'
Shnatsel commented 3 years ago

Here's a profile of the latest code on my 4-core machine with 4 threads: https://share.firefox.dev/2WnspHl

I've also enabled debug info in release mode to make the profile more detailed.

sstadick commented 3 years ago

Weird. I'm not sure what else to try at the moment to figure out why -p3 is faster than -p4 on your quad-core. That's encouraging to see that on the dual-core your numbers look more like what I've been seeing.

Shnatsel commented 3 years ago

-p3 is only faster for low compression ratios. For high compression -p4 is faster.

As to why, I see that pigz reports more user time than crabz:

> time target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt > /dev/null
Aug 25 22:17:52.560  INFO crabz: Compressing with 4 threads at compression level 3.
11.87user 0.15system 0:04.40elapsed 273%CPU (0avgtext+0avgdata 10076maxresident)k
0inputs+0outputs (0major+3898minor)pagefaults 0swaps
> time target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt > /dev/null
Aug 25 22:18:03.095  INFO crabz: Compressing with 3 threads at compression level 3.
10.11user 0.19system 0:03.71elapsed 277%CPU (0avgtext+0avgdata 8580maxresident)k
0inputs+0outputs (0major+7666minor)pagefaults 0swaps
> time pigz -p3 -3 < /media/shnatsel/ssd/large-file.txt > /dev/null
13.81user 0.23system 0:04.49elapsed 312%CPU (0avgtext+0avgdata 4632maxresident)k
0inputs+0outputs (0major+712minor)pagefaults 0swaps
> time pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt > /dev/null
13.60user 0.10system 0:03.45elapsed 397%CPU (0avgtext+0avgdata 5376maxresident)k
0inputs+0outputs (0major+878minor)pagefaults 0swaps

This indicates that crabz spends some of its time idling and cannot achieve 100% CPU utilization. This is typically caused by parallel tasks being bottlenecked on something single-threaded - e.g. I/O, checksuming, or straight up lock contention.

sstadick commented 3 years ago

That makes sense.... and is a flaw.

gzp waits till num-threads buffers are available then processes them all at once. So if IO is slow, and it can processes the buffers faster than the reader can provide another num-threads buffers it will spin. Unlike pigz which processes items as the come. There may be a middle ground yet.

sstadick commented 3 years ago

I just pushed another set of changes to gzp to make it process values as they come instead of trying to buffer into a queue. I expect it to be a small performance hit on very fast IO systems, but it should greatly improve things on the 4-core system (I hope!).

Shnatsel commented 3 years ago

Oh yeah, that did the trick for the quad-core! All 4 compression threads are utilized now, and performance is either on par with pigz (for -c3) or 17% better (for -c9). Detailed timings: https://gist.github.com/Shnatsel/3128036e67fd8647787df281422cc73a

Shnatsel commented 3 years ago

The dual-core system took a noticeable hit, but still beats pigz:

> hyperfine -w3 'crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'pigz -p2 -3 < ~/shakespeare_50_times.txt';
Benchmark #1: crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):      7.991 s ±  0.067 s    [User: 14.270 s, System: 0.429 s]
  Range (min … max):    7.870 s …  8.088 s    10 runs

Benchmark #2: crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):     10.066 s ±  0.229 s    [User: 14.074 s, System: 0.348 s]
  Range (min … max):    9.793 s … 10.354 s    10 runs

Benchmark #3: pigz -p2 -3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):     11.388 s ±  0.028 s    [User: 21.496 s, System: 0.381 s]
  Range (min … max):   11.354 s … 11.430 s    10 runs

Summary
  'crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' ran
    1.26 ± 0.03 times faster than 'crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt'
    1.43 ± 0.01 times faster than 'pigz -p2 -3 < ~/shakespeare_50_times.txt'

Also, why do you use flume as your mpmc channel when you already have crossbeam-channel pulled in as a dependency by rayon? That way you have two channel implementations in your binary.

Shnatsel commented 3 years ago

Dual-core profiles: Before: https://share.firefox.dev/3jiJtHk After: https://share.firefox.dev/3kmteIM

Shnatsel commented 3 years ago

The "after" profile shows the time spent in Flume in the checksumming thread go up from 1s to 1.8s, so I wonder if crossbeam-deque would perform better under the high contention? Since it's already in the binary because of rayon, it's probably worth a shot.

sstadick commented 3 years ago

flume was a holdover from the initial versions of gzp that was built on tokio. I removed and and went to just crossbeam and hand no real performance change.

I did try one more thing though, which stripped out rayon entirely. I think it should bring back that 2 core performance. The cost though is that instead of letting rayon manage a threadpool, this keeps num_threads running from the start. I'm undecided if that tradeoff is worth it or not yet, but certainly seems much faster.

Same branch if you are interested!

Also, thanks for bearing with me through this, your feedback has been extremely helpful :+1:

Shnatsel commented 3 years ago

I'll test it out!

If crossbeam-deque and flume provide identical performance, I'd stick with flume because it has dramatically less unsafe code in it. Crossbeam is really quite complex due to the custom lock-free algorithms, and it's all unsafe code, naturally. If that complexity can be avoided, I'm all for it.

Shnatsel commented 3 years ago

Also, speaking of dependencies, I've run cargo geiger on crabz and turns out that color-eyre crate pulls in a huge number of dependencies, many of them with large amounts of unsafe code. Is that dependency essential? I imagine this will make crabz quite difficult to package for a Linux distro in the future.

Shnatsel commented 3 years ago

On my quad-core crabz without rayon is astonishingly fast. Like, 50% faster than both pigz and previous crabz:

> hyperfine -w3 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz-rayonless/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      3.421 s ±  0.018 s    [User: 10.992 s, System: 0.161 s]
  Range (min … max):    3.381 s …  3.442 s    10 runs

Benchmark #2: crabz-rayonless/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      2.400 s ±  0.013 s    [User: 9.349 s, System: 0.133 s]
  Range (min … max):    2.373 s …  2.417 s    10 runs

Benchmark #3: pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt
  Time (mean ± σ):      3.511 s ±  0.015 s    [User: 13.757 s, System: 0.138 s]
  Range (min … max):    3.492 s …  3.534 s    10 runs

Summary
  'crabz-rayonless/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' ran
    1.43 ± 0.01 times faster than 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt'
    1.46 ± 0.01 times faster than 'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt'

When I see numbers like these I usually assume I messed up correctness and the program actually does less work than it's supposed to. But no, the round-tripped file decompresses to the original data correctly! :tada: :rocket: :partying_face:

Shnatsel commented 3 years ago

Dual-core is back to the original numbers for crabz, and beating pigz:

> hyperfine -w3 'crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'crabz-rayonless/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'pigz -p2 -3 < ~/shakespeare_50_times.txt'
Benchmark #1: crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):      7.992 s ±  0.060 s    [User: 14.200 s, System: 0.419 s]
  Range (min … max):    7.870 s …  8.060 s    10 runs

Benchmark #2: crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):      9.940 s ±  0.213 s    [User: 13.864 s, System: 0.328 s]
  Range (min … max):    9.789 s … 10.402 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark #3: crabz-rayonless/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):      8.317 s ±  0.040 s    [User: 15.505 s, System: 0.419 s]
  Range (min … max):    8.253 s …  8.398 s    10 runs

Benchmark #4: pigz -p2 -3 < ~/shakespeare_50_times.txt
  Time (mean ± σ):     11.405 s ±  0.051 s    [User: 21.529 s, System: 0.351 s]
  Range (min … max):   11.350 s … 11.505 s    10 runs

Summary
  'crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' ran
    1.04 ± 0.01 times faster than 'crabz-rayonless/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt'
    1.24 ± 0.03 times faster than 'crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt'
    1.43 ± 0.01 times faster than 'pigz -p2 -3 < ~/shakespeare_50_times.txt'
sstadick commented 3 years ago

That's awesome!

I've moved to flume only. I need to do more rigourous testing to decide on a deafult backend between zlib-ng, zlib, and the rust backend.

Regarding crabz deps I've removed tracing and color-eyre and made things more standard. There sitll a good bit of work to do to add a single threaded mode and a few more CLI options similar to pigz. But it's a start!

Thanks again for working on this! I'll be putting out new releases of both gzp and crabz with all the updates discussed here in the next day or two.

Shnatsel commented 3 years ago

Thanks to you for acting on this!

sstadick commented 3 years ago

See gzp v0.6.0 and crabz v0.2.0