Closed nalabrie closed 1 month ago
Whoops I think I put the wrong pull number in the CHANGELOG.md
This needs more testing, in my opinion. It works, but the speedup isn't as significant as it probably could be. Here is a result of a quick test using a testdir
folder of 57 files totaling ~200MB:
ouch
stable version, ouch c --level 20 testdir output.tar.zst
takes about 1 minute, 46 seconds (106910 ms
).67829 ms
).zstd
version 1.5.5
, zstd -T0 --ultra -20 testdir.tar
takes about 35 seconds (35106 ms
).
testdir
was already a tar
file, which I timed separately to be about 400 ms
.--ultra
flag simply lets the level go above 19.the speedup isn't as significant as it probably could be
I see, to be honest I'm not 100% sure why our version is slower.
I'll let you make the call tho, if you want to investigate further or you think we can merge this as it is.
If you decide to investigate, I highly highly recommend looking into flamegraph
which can be installed from the crate cargo-flamegraph
(you don't need to use it through cargo, it's a binary), with it, you can run both zstd
and Ouch
to see what's taking most of the time.
But it's a valuable improvement regardless! So feel incentivized and not pressured to do so :sweat_smile: .
Thank you, I am happy that this change is appreciated and I'm glad to contribute. I will make the changes that you requested in the coming hours.
As for investigating why the speedup isn't on par with the zstd tool, I do want to take a closer look at it but no promises on finding a solution. I enjoy Rust as a hobby and consider myself to be at an intermediate level, but I'm not super confident in my abilities if I'm being honest. I've heard of flamegraph and have wanted to try it out so this gives me a good excuse to learn it. I will work on this some today and tomorrow but if I cannot find anything then I'm fine with merging it as is (after the changes mentioned above, of course).
I'm not at my Linux machine right now, so I can't run flamegraph
. Apparently it doesn't work on Windows, or at least I can't get it to work. I should have time to do that tomorrow, but for now I wanted to at least provide some proper benchmarks. I have also made the requested changes to the PR @marcospb19.
I did some research and found this issue on the repo for the zstd
crate that ouch
is using. The owner mentions that there is overhead involved in using the "streaming API." Perhaps that is a cause of the slowdown compared to the zstd
binary?
zstd
benchmarks..tar.zst
archive using level 20.st
= single-threaded, mt
= multi-threaded (using all available physical cores).zstd
1.5.6ouch
0.5.1 (for single-threaded benchmarks)ouch
this commit from my fork (for multi-threaded benchmarks) (release
profile)ouch (st) | ouch (mt) | zstd (mt) | |
---|---|---|---|
Run 1 | 111.9764401 | 69.7551597 | 47.8371113 |
Run 2 | 112.2669172 | 69.3189215 | 47.6084749 |
Run 3 | 112.0817931 | 69.1581166 | 47.5799035 |
Run 4 | 111.8815577 | 69.2234262 | 47.9138429 |
Run 5 | 111.7506892 | 69.5047793 | 47.7037464 |
Run 6 | 112.0945928 | 69.2046794 | 47.5450366 |
Run 7 | 112.1228552 | 69.196134 | 47.8802425 |
Run 8 | 111.7783502 | 69.2298791 | 47.5509878 |
Run 9 | 112.0262394 | 69.2156705 | 48.0269953 |
Run 10 | 111.7842633 | 69.3845321 | 47.8702028 |
Average | 111.97636982 | 69.31912984 | 47.7516544 |
In short, multi-threaded ouch
takes about 38% less time than single-threaded ouch
and the zstd
binary takes about 31% less time than multi-threaded ouch
.
Here is the output of flamegraph --root -- ./ouch c --level 20 silesia silesia.tar.zst
. Ouch was compiled with the dev
profile.
Judging by the comment preview, it looks like the svg
isn't interactive like its supposed to be. If that is the case for anyone reading this, I have also uploaded the file here.
I can somewhat understand how to parse that information but I do not know where to begin when it comes to using that to optimize further. At least not at the moment, but when I have more free time I will possibly investigate further. I'm fine with merging it as is.
As a side note, here is more terminal output that flamegraph
spit out. Not sure how important it is.
[ perf record: Woken up 5997 times to write data ]
Warning:
Processed 101962 events and lost 6 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 1504.683 MB perf.data (95002 samples) ]
writing flamegraph to "flamegraph.svg"
Another side note, it seems to run better on Linux compared to Windows (same hardware), but that isn't too surprising to me. The gap is still there between ouch
and zstd
, however. I only timed one test, so no averages.
./ouch c --level 20 silesia silesia.tar.zst
took 54.9 seconds (release
profile).tar cf silesia.tar silesia && zstdmt --ultra -20 silesia.tar
took 43.888 seconds.Oh thank you so much for all the investigation and effort you put into this! I highly appreciate it.
I did some research and found https://github.com/gyscos/zstd-rs/issues/266 on the repo for the zstd crate that ouch is using. The owner mentions that there is overhead involved in using the "streaming API." Perhaps that is a cause of the slowdown compared to the zstd binary?
That'd make sense! I didn't know that, and yeah we use the streaming API to chain encoders and decoders.
It also makes sense that it adds overhead since we make lots of small calls instead of just one.
I can somewhat understand how to parse that information but I do not know where to begin when it comes to using that to optimize further.
Yeah if you run flamegraph
just against Ouch
, you'll be a little lost.
To compare both tools you'd need to run flamegraph
against ouch
and zstd
, and if we use the same implementation (and functions, I'm not entirely sure), then you'd see the same bars but with different proportions.
With that, by comparing the size of the bars you could get a rough comparison of where the extra time is spent.
However, on second thought, I don't advise you to investigate further, flamegraph sampling with multithreading can be unpredictable, and we don't know if we are calling the exact same functions, and you'd likely find out at the end that it's indeed a problem in the Streaming API.
Well, you can do it if you're still curious :smile: but you've done a lot for this PR, ty!
Hello, this is my first PR here and my Rust is a bit rusty, so forgive me if this is a foolish change or if PRs are unwelcome. I created issue #688 and decided to take a stab at implementing it.