Closed Shnatsel closed 3 years ago
Thanks for the detailed issue!
I think this comes down to accounting and the fact that crabz
has not had threading tuned at all.
Looking at pigz, it oversubscribes threads: https://github.com/madler/pigz/blob/b6da942b9ca15eb9149837f07b2b3b6ff21d9845/pigz.c#L2206, in that it will spawn as many threads as their are cores + a writer thread + the main thread.
gzp
accounts for the writer thread, subtracting 1 from the number of threads it is allowed to spawn for compression. I'm not sure what the best way to represent that is / if just hiding that writer thread from the gzp
cost is more helpful to end users since you can oversubscribe like pigz.
Do you have any thoughts on what would make the most sense for gzp
as an end user?
What I'm thinking at the moment is that I'll change the documentation / function names so that num_threads
is explicitly setting the number of compression threads, and note that an additional thread is used for writing as well. I think that will better match users expectations when setting the number of threads.
This branch of crabz
is using a branch of gzp
where I've made the above changes and also added an option to specifically set the number of compression threads. If you still have your env set up and don't mind giving int a try I'd be curious to see if that evens our the performance.
The threading_like_pigz
branch is currently identical to main
, could you make sure you've pushed the changes to github?
The profile linked above shows that the main thread and writer thread together do not occupy an entire core, so one core out of 4 ends up being mostly idle in my configuration. crc32 and writing are very fast, it seems. I believe it's best to match the number of compression threads to the number of CPUs, like pigz does already.
Ah! Sorry about that, changes have been pushed.
I don't have that exact setup anymore, but I've tried it on the same machine with a different Linux OS and the results are inconclusive: having more threads seems to help on higher compression ratios but hinder at lower ones. Benches, with the same Shakespeare file repeated 100 times:
shnatsel@shnatsel-desktop ~/Code> hyperfine 'crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 18.458 s ± 0.240 s [User: 45.843 s, System: 0.236 s]
Range (min … max): 17.991 s … 18.732 s 10 runs
Benchmark #2: crabz-new/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 16.890 s ± 0.217 s [User: 48.440 s, System: 0.264 s]
Range (min … max): 16.672 s … 17.254 s 10 runs
Summary
'crabz-new/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt' ran
1.09 ± 0.02 times faster than 'crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt'
shnatsel@shnatsel-desktop ~/Code> hyperfine 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.289 s ± 0.060 s [User: 11.144 s, System: 0.191 s]
Range (min … max): 4.135 s … 4.353 s 10 runs
Benchmark #2: crabz-new/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.781 s ± 0.074 s [User: 12.626 s, System: 0.214 s]
Range (min … max): 4.674 s … 4.871 s 10 runs
Summary
'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt' ran
1.11 ± 0.02 times faster than 'crabz-new/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt'
pigz 2.4 from Ubuntu 18.04 repos is still faster:
shnatsel@shnatsel-desktop ~/Code> hyperfine 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.342 s ± 0.102 s [User: 11.203 s, System: 0.179 s]
Range (min … max): 4.168 s … 4.466 s 10 runs
Benchmark #2: pigz -3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 3.491 s ± 0.041 s [User: 13.664 s, System: 0.130 s]
Range (min … max): 3.444 s … 3.569 s 10 runs
Summary
'pigz -3 < /media/shnatsel/ssd/large-file.txt' ran
1.24 ± 0.03 times faster than 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt'
Perhaps the difference in performance comes down to the differences in the underlying zlib implementation? Is there a flag for crabz
that I could use to force using a single compression thread?
Weird. Here's my results, not in thread limited environment, but limiting with flags:
❯ hyperfine './target/release/crabz -c 3 -p 4 < ../gzp/bench-data/shakespeare.txt' 'pigz -3 -p 4 < ../gzp/bench-data/shakespeare.txt'
Benchmark #1: ./target/release/crabz -c 3 -p 4 < ../gzp/bench-data/shakespeare.txt
Time (mean ± σ): 2.043 s ± 0.020 s [User: 7.577 s, System: 0.156 s]
Range (min … max): 2.017 s … 2.069 s 10 runs
Benchmark #2: pigz -3 -p 4 < ../gzp/bench-data/shakespeare.txt
Time (mean ± σ): 2.981 s ± 0.014 s [User: 12.209 s, System: 0.226 s]
Range (min … max): 2.957 s … 3.007 s 10 runs
Summary
'./target/release/crabz -c 3 -p 4 < ../gzp/bench-data/shakespeare.txt' ran
1.46 ± 0.02 times faster than 'pigz -3 -p 4 < ../gzp/bench-data/shakespeare.txt'
That's crabz
off the same branch I liked before, I pushed on new commit that allows gzp
to go down to 1 thread. There is now a -p
flag for crabz
that specified the number of compression threads it can use (run cargo update
to clear the git cache for cargo).
Results for -p1
Benchmark #1: ./target/release/crabz -c 3 -p 1 < ../gzp/bench-data/shakespeare.txt
Time (mean ± σ): 6.771 s ± 0.058 s [User: 7.207 s, System: 0.126 s]
Range (min … max): 6.681 s … 6.851 s 10 runs
Benchmark #2: pigz -3 -p 1 < ../gzp/bench-data/shakespeare.txt
Time (mean ± σ): 11.420 s ± 0.158 s [User: 11.336 s, System: 0.079 s]
Range (min … max): 11.079 s … 11.605 s 10 runs
Summary
'./target/release/crabz -c 3 -p 1 < ../gzp/bench-data/shakespeare.txt' ran
1.69 ± 0.03 times faster than 'pigz -3 -p 1 < ../gzp/bench-data/shakespeare.txt'
Even giving both crabz
and pigz
all my threads (32), on compression levels 3
and 9
crabz
is 10-20% faster.
The default zlib library for crabz
/ gzp
is ng-zlib
, switching the feature flag to deflate_zlib
or deflate_rust
leads to no performance change with -p 4
for me.
So to get apples-to-apples (ish? does pigz link to system zlib?) zlib, change the gzp
dep in crabz
to:
gzp = { git = "https://github.com/sstadick/gzp", branch = "feature/allow_oversubscribed_writer", no-default-features = true, features = ["deflate_zlib"]}
I can't imagine that it matters that much, but what version of rust are you running?
How did you install pigz
? I am also running 2.4 from installed via apt
on ubuntu 20.04
Right now comparing with pigz 2.4 installed via apt on Ubuntu 18.04
For crabz I use a git checkout and then cargo build --release
. I was using the Cargo.lock from the crabz repo up until now, but had to run cargo update
to pull in new version of gzp
.
> rustc --version --verbose
rustc 1.52.1 (9bc8c42bb 2021-05-09)
binary: rustc
commit-hash: 9bc8c42bb2f19e745a63f3445f1ac248fb015e53
commit-date: 2021-05-09
host: x86_64-unknown-linux-gnu
release: 1.52.1
LLVM version: 12.0.0
Overcommit seems to help my 4-core system, but just barely:
> hyperfine 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.706 s ± 0.069 s [User: 12.399 s, System: 0.201 s]
Range (min … max): 4.619 s … 4.796 s 10 runs
Benchmark #2: crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.894 s ± 0.072 s [User: 11.943 s, System: 0.198 s]
Range (min … max): 4.804 s … 5.025 s 10 runs
Summary
'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' ran
1.04 ± 0.02 times faster than 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt'
In single-threaded mode crabz
seems to run much faster than pigz
:
> hyperfine 'crabz-new/target/release/crabz -p1 -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p1 -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p1 -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 8.735 s ± 0.153 s [User: 9.703 s, System: 0.217 s]
Range (min … max): 8.609 s … 9.120 s 10 runs
Benchmark #2: pigz -p1 -3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 12.968 s ± 0.294 s [User: 12.896 s, System: 0.066 s]
Range (min … max): 12.757 s … 13.517 s 10 runs
Summary
'crabz-new/target/release/crabz -p1 -c3 < /media/shnatsel/ssd/large-file.txt' ran
1.48 ± 0.04 times faster than 'pigz -p1 -3 < /media/shnatsel/ssd/large-file.txt'
But pigz overtakes crabz when using 4 threads:
> hyperfine 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.682 s ± 0.106 s [User: 12.331 s, System: 0.196 s]
Range (min … max): 4.512 s … 4.848 s 10 runs
Benchmark #2: pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 3.526 s ± 0.058 s [User: 13.691 s, System: 0.142 s]
Range (min … max): 3.457 s … 3.643 s 10 runs
Summary
'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt' ran
1.33 ± 0.04 times faster than 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt'
Removing overcommit hurts performance slightly in case of crabz
and significantly in case of pigz
:
> hyperfine 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p3 -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.835 s ± 0.107 s [User: 11.788 s, System: 0.193 s]
Range (min … max): 4.642 s … 4.987 s 10 runs
Benchmark #2: pigz -p3 -3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.743 s ± 0.117 s [User: 14.478 s, System: 0.173 s]
Range (min … max): 4.586 s … 4.955 s 10 runs
Summary
'pigz -p3 -3 < /media/shnatsel/ssd/large-file.txt' ran
1.02 ± 0.03 times faster than 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt'
I've tried the Rust backend instead of zlib-ng and saw the exact same performance in both single-threaded and multi-threaded mode.
So I guess the actionable takeaways are:
crabz
appears to be higher than that of pigz
on my system, since the difference between 3 and 4 threads is so pronounced for pigz but barely exists for crabz. A profile of where crabz spends the time can be found here.gzp
should default to the 100% safe Rust backend for flate2, since performance is the same anyway.I'll test a dual-core system next and see if 1 or 2 threads works best there.
I just pushed a new commit to gzp
. I realized that when I "fixed" the num_threads
to be just compression threads I didn't re-adjust by queue sizes which are all based on the number of threads. So the queues were allowing for very little buffer to build up. Instead of just having the num_threads
as the queuesize, I've made the queues 2*num_threads, which is the same as pigz.
This gave an appreciable performance bump on my system.
Building after cargo update
should pull it in.
I agree on point 2 though. I want to re-test things no that work is getting to the compressors faster that the zlib library doesn't make a difference, but if it's narrow enough I'd rather have an all rust backend.
Thanks for sharing the profile info, looking at that now.
On my quad-core Ryzen overcommit is a toss-up. However, preliminary results indicate that having 2 compression threads on a dual-core system increases performance dramatically. I'll post the full dual-core results shortly.
Full timings from the quad-core Ryzen with the buffer size changes:
> hyperfine -w3 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt'; hyperfine 'crabz-new/target/release/crabz -p4 -c9 < /media/shnatsel/ssd/large-file.txt' 'crabz-new/target/release/crabz -p3 -c9 < /media/shnatsel/ssd/large-file.txt' 'crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt' 'pigz -p4 -9 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.396 s ± 0.061 s [User: 12.319 s, System: 0.168 s]
Range (min … max): 4.305 s … 4.483 s 10 runs
Benchmark #2: crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.291 s ± 0.158 s [User: 11.156 s, System: 0.192 s]
Range (min … max): 4.112 s … 4.616 s 10 runs
Benchmark #3: crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 4.386 s ± 0.096 s [User: 11.316 s, System: 0.174 s]
Range (min … max): 4.269 s … 4.572 s 10 runs
Benchmark #4: pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 3.527 s ± 0.016 s [User: 13.820 s, System: 0.136 s]
Range (min … max): 3.496 s … 3.552 s 10 runs
Summary
'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt' ran
1.22 ± 0.05 times faster than 'crabz-new/target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt'
1.24 ± 0.03 times faster than 'crabz/target/release/crabz -c3 < /media/shnatsel/ssd/large-file.txt'
1.25 ± 0.02 times faster than 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c9 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 16.543 s ± 0.262 s [User: 48.818 s, System: 0.196 s]
Range (min … max): 16.094 s … 16.908 s 10 runs
Benchmark #2: crabz-new/target/release/crabz -p3 -c9 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 17.457 s ± 0.174 s [User: 44.624 s, System: 0.207 s]
Range (min … max): 17.154 s … 17.742 s 10 runs
Benchmark #3: crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 17.493 s ± 0.129 s [User: 44.671 s, System: 0.197 s]
Range (min … max): 17.338 s … 17.697 s 10 runs
Benchmark #4: pigz -p4 -9 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 15.614 s ± 0.070 s [User: 61.647 s, System: 0.168 s]
Range (min … max): 15.521 s … 15.754 s 10 runs
Summary
'pigz -p4 -9 < /media/shnatsel/ssd/large-file.txt' ran
1.06 ± 0.02 times faster than 'crabz-new/target/release/crabz -p4 -c9 < /media/shnatsel/ssd/large-file.txt'
1.12 ± 0.01 times faster than 'crabz-new/target/release/crabz -p3 -c9 < /media/shnatsel/ssd/large-file.txt'
1.12 ± 0.01 times faster than 'crabz/target/release/crabz -c9 < /media/shnatsel/ssd/large-file.txt'
Having 2 compression threads instead of 1 seems to be greatly beneficial on a dual-core system.
On a dual-core AMD Stoney Ridge system crabz
with 2 compression threads beats pigz
with a large margin:
hyperfine -w3 'target/release/crabz -p1 -c3 < ~/shakespeare_50_times.txt' 'target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'pigz -p2 -3 < ~/shakespeare_50_times.txt'; hyperfine 'target/release/crabz -p1 -c9 < ~/shakespeare_50_times.txt' 'target/release/crabz -p2 -c9 < ~/shakespeare_50_times.txt' 'pigz -p2 -9 < ~/shakespeare_50_times.txt'
Benchmark #1: target/release/crabz -p1 -c3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 12.761 s ± 0.250 s [User: 13.368 s, System: 0.323 s]
Range (min … max): 12.391 s … 12.958 s 10 runs
Benchmark #2: target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 8.483 s ± 0.060 s [User: 15.180 s, System: 0.456 s]
Range (min … max): 8.415 s … 8.604 s 10 runs
Benchmark #3: pigz -p2 -3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 10.901 s ± 0.010 s [User: 21.267 s, System: 0.363 s]
Range (min … max): 10.885 s … 10.914 s 10 runs
Summary
'target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' ran
1.29 ± 0.01 times faster than 'pigz -p2 -3 < ~/shakespeare_50_times.txt'
1.50 ± 0.03 times faster than 'target/release/crabz -p1 -c3 < ~/shakespeare_50_times.txt'
Benchmark #1: target/release/crabz -p1 -c9 < ~/shakespeare_50_times.txt
Time (mean ± σ): 55.027 s ± 0.527 s [User: 55.533 s, System: 0.335 s]
Range (min … max): 54.150 s … 56.303 s 10 runs
Benchmark #2: target/release/crabz -p2 -c9 < ~/shakespeare_50_times.txt
Time (mean ± σ): 35.056 s ± 0.305 s [User: 60.079 s, System: 0.564 s]
Range (min … max): 34.556 s … 35.766 s 10 runs
Benchmark #3: pigz -p2 -9 < ~/shakespeare_50_times.txt
Time (mean ± σ): 50.521 s ± 1.243 s [User: 98.581 s, System: 0.398 s]
Range (min … max): 49.524 s … 52.373 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'target/release/crabz -p2 -c9 < ~/shakespeare_50_times.txt' ran
1.44 ± 0.04 times faster than 'pigz -p2 -9 < ~/shakespeare_50_times.txt'
1.57 ± 0.02 times faster than 'target/release/crabz -p1 -c9 < ~/shakespeare_50_times.txt'
Here's a profile of the latest code on my 4-core machine with 4 threads: https://share.firefox.dev/2WnspHl
I've also enabled debug info in release mode to make the profile more detailed.
Weird. I'm not sure what else to try at the moment to figure out why -p3
is faster than -p4
on your quad-core. That's encouraging to see that on the dual-core your numbers look more like what I've been seeing.
-p3
is only faster for low compression ratios. For high compression -p4
is faster.
As to why, I see that pigz
reports more user time than crabz
:
> time target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt > /dev/null
Aug 25 22:17:52.560 INFO crabz: Compressing with 4 threads at compression level 3.
11.87user 0.15system 0:04.40elapsed 273%CPU (0avgtext+0avgdata 10076maxresident)k
0inputs+0outputs (0major+3898minor)pagefaults 0swaps
> time target/release/crabz -p3 -c3 < /media/shnatsel/ssd/large-file.txt > /dev/null
Aug 25 22:18:03.095 INFO crabz: Compressing with 3 threads at compression level 3.
10.11user 0.19system 0:03.71elapsed 277%CPU (0avgtext+0avgdata 8580maxresident)k
0inputs+0outputs (0major+7666minor)pagefaults 0swaps
> time pigz -p3 -3 < /media/shnatsel/ssd/large-file.txt > /dev/null
13.81user 0.23system 0:04.49elapsed 312%CPU (0avgtext+0avgdata 4632maxresident)k
0inputs+0outputs (0major+712minor)pagefaults 0swaps
> time pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt > /dev/null
13.60user 0.10system 0:03.45elapsed 397%CPU (0avgtext+0avgdata 5376maxresident)k
0inputs+0outputs (0major+878minor)pagefaults 0swaps
This indicates that crabz
spends some of its time idling and cannot achieve 100% CPU utilization. This is typically caused by parallel tasks being bottlenecked on something single-threaded - e.g. I/O, checksuming, or straight up lock contention.
That makes sense.... and is a flaw.
gzp
waits till num-threads
buffers are available then processes them all at once. So if IO is slow, and it can processes the buffers faster than the reader can provide another num-threads
buffers it will spin. Unlike pigz
which processes items as the come. There may be a middle ground yet.
I just pushed another set of changes to gzp
to make it process values as they come instead of trying to buffer into a queue. I expect it to be a small performance hit on very fast IO systems, but it should greatly improve things on the 4-core system (I hope!).
Oh yeah, that did the trick for the quad-core! All 4 compression threads are utilized now, and performance is either on par with pigz (for -c3
) or 17% better (for -c9
). Detailed timings: https://gist.github.com/Shnatsel/3128036e67fd8647787df281422cc73a
The dual-core system took a noticeable hit, but still beats pigz
:
> hyperfine -w3 'crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'pigz -p2 -3 < ~/shakespeare_50_times.txt';
Benchmark #1: crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 7.991 s ± 0.067 s [User: 14.270 s, System: 0.429 s]
Range (min … max): 7.870 s … 8.088 s 10 runs
Benchmark #2: crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 10.066 s ± 0.229 s [User: 14.074 s, System: 0.348 s]
Range (min … max): 9.793 s … 10.354 s 10 runs
Benchmark #3: pigz -p2 -3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 11.388 s ± 0.028 s [User: 21.496 s, System: 0.381 s]
Range (min … max): 11.354 s … 11.430 s 10 runs
Summary
'crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' ran
1.26 ± 0.03 times faster than 'crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt'
1.43 ± 0.01 times faster than 'pigz -p2 -3 < ~/shakespeare_50_times.txt'
Also, why do you use flume
as your mpmc channel when you already have crossbeam-channel
pulled in as a dependency by rayon
? That way you have two channel implementations in your binary.
Dual-core profiles: Before: https://share.firefox.dev/3jiJtHk After: https://share.firefox.dev/3kmteIM
The "after" profile shows the time spent in Flume in the checksumming thread go up from 1s to 1.8s, so I wonder if crossbeam-deque would perform better under the high contention? Since it's already in the binary because of rayon, it's probably worth a shot.
flume was a holdover from the initial versions of gzp
that was built on tokio. I removed and and went to just crossbeam and hand no real performance change.
I did try one more thing though, which stripped out rayon entirely. I think it should bring back that 2 core performance. The cost though is that instead of letting rayon manage a threadpool, this keeps num_threads
running from the start. I'm undecided if that tradeoff is worth it or not yet, but certainly seems much faster.
Same branch if you are interested!
Also, thanks for bearing with me through this, your feedback has been extremely helpful :+1:
I'll test it out!
If crossbeam-deque and flume provide identical performance, I'd stick with flume because it has dramatically less unsafe code in it. Crossbeam is really quite complex due to the custom lock-free algorithms, and it's all unsafe code, naturally. If that complexity can be avoided, I'm all for it.
Also, speaking of dependencies, I've run cargo geiger
on crabz
and turns out that color-eyre
crate pulls in a huge number of dependencies, many of them with large amounts of unsafe code. Is that dependency essential? I imagine this will make crabz
quite difficult to package for a Linux distro in the future.
On my quad-core crabz
without rayon
is astonishingly fast. Like, 50% faster than both pigz
and previous crabz
:
> hyperfine -w3 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'crabz-rayonless/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' 'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt'
Benchmark #1: crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 3.421 s ± 0.018 s [User: 10.992 s, System: 0.161 s]
Range (min … max): 3.381 s … 3.442 s 10 runs
Benchmark #2: crabz-rayonless/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 2.400 s ± 0.013 s [User: 9.349 s, System: 0.133 s]
Range (min … max): 2.373 s … 2.417 s 10 runs
Benchmark #3: pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt
Time (mean ± σ): 3.511 s ± 0.015 s [User: 13.757 s, System: 0.138 s]
Range (min … max): 3.492 s … 3.534 s 10 runs
Summary
'crabz-rayonless/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt' ran
1.43 ± 0.01 times faster than 'crabz-new/target/release/crabz -p4 -c3 < /media/shnatsel/ssd/large-file.txt'
1.46 ± 0.01 times faster than 'pigz -p4 -3 < /media/shnatsel/ssd/large-file.txt'
When I see numbers like these I usually assume I messed up correctness and the program actually does less work than it's supposed to. But no, the round-tripped file decompresses to the original data correctly! :tada: :rocket: :partying_face:
Dual-core is back to the original numbers for crabz
, and beating pigz
:
> hyperfine -w3 'crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'crabz-rayonless/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' 'pigz -p2 -3 < ~/shakespeare_50_times.txt'
Benchmark #1: crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 7.992 s ± 0.060 s [User: 14.200 s, System: 0.419 s]
Range (min … max): 7.870 s … 8.060 s 10 runs
Benchmark #2: crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 9.940 s ± 0.213 s [User: 13.864 s, System: 0.328 s]
Range (min … max): 9.789 s … 10.402 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark #3: crabz-rayonless/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 8.317 s ± 0.040 s [User: 15.505 s, System: 0.419 s]
Range (min … max): 8.253 s … 8.398 s 10 runs
Benchmark #4: pigz -p2 -3 < ~/shakespeare_50_times.txt
Time (mean ± σ): 11.405 s ± 0.051 s [User: 21.529 s, System: 0.351 s]
Range (min … max): 11.350 s … 11.505 s 10 runs
Summary
'crabz/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt' ran
1.04 ± 0.01 times faster than 'crabz-rayonless/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt'
1.24 ± 0.03 times faster than 'crabz-streaming/target/release/crabz -p2 -c3 < ~/shakespeare_50_times.txt'
1.43 ± 0.01 times faster than 'pigz -p2 -3 < ~/shakespeare_50_times.txt'
That's awesome!
I've moved to flume only. I need to do more rigourous testing to decide on a deafult backend between zlib-ng, zlib, and the rust backend.
Regarding crabz
deps I've removed tracing and color-eyre
and made things more standard. There sitll a good bit of work to do to add a single threaded mode and a few more CLI options similar to pigz. But it's a start!
Thanks again for working on this! I'll be putting out new releases of both gzp
and crabz
with all the updates discussed here in the next day or two.
Thanks to you for acting on this!
See gzp
v0.6.0 and crabz
v0.2.0
I've run some tests comparing
crabz
topigz
using the benchmarking setup described in crabz readme. On a 4-core system with no hyperthreadingcrabz
was measurably slower.I've profiled both using
perf
and it turned out thatcrabz
spends the vast majority of the time in zlib compression, so parallelization overhead is not an issue. However, crabz only spawned 3 threads performing compression while pigz spawned 4 compression threads. After passing-p3
topigz
so that it would only spawn 3 compression threads, the compression time became identical tocrabz
.I suspect this is also why you're not seeing any parallelization gains on dual-core systems.
Technical details
crabz
profile: https://share.firefox.dev/3zeVRxNpigz
profile: https://share.firefox.dev/2WeYe4Vcrabz
installed viacargo install crabz
on a clean Ubuntu 20.04 installation,pigz
installed via apt.