ouch-org / ouch

Painless compression and decompression in the terminal
https://crates.io/crates/ouch
Other
2.18k stars 75 forks source link

Added multithreading to zstd compression #689

Closed nalabrie closed 1 month ago

nalabrie commented 1 month ago

Hello, this is my first PR here and my Rust is a bit rusty, so forgive me if this is a foolish change or if PRs are unwelcome. I created issue #688 and decided to take a stab at implementing it.

nalabrie commented 1 month ago

Whoops I think I put the wrong pull number in the CHANGELOG.md

nalabrie commented 1 month ago

This needs more testing, in my opinion. It works, but the speedup isn't as significant as it probably could be. Here is a result of a quick test using a testdir folder of 57 files totaling ~200MB:

marcospb19 commented 1 month ago

the speedup isn't as significant as it probably could be

I see, to be honest I'm not 100% sure why our version is slower.

I'll let you make the call tho, if you want to investigate further or you think we can merge this as it is.

If you decide to investigate, I highly highly recommend looking into flamegraph which can be installed from the crate cargo-flamegraph (you don't need to use it through cargo, it's a binary), with it, you can run both zstd and Ouch to see what's taking most of the time.

But it's a valuable improvement regardless! So feel incentivized and not pressured to do so :sweat_smile: .

nalabrie commented 1 month ago

Thank you, I am happy that this change is appreciated and I'm glad to contribute. I will make the changes that you requested in the coming hours.

As for investigating why the speedup isn't on par with the zstd tool, I do want to take a closer look at it but no promises on finding a solution. I enjoy Rust as a hobby and consider myself to be at an intermediate level, but I'm not super confident in my abilities if I'm being honest. I've heard of flamegraph and have wanted to try it out so this gives me a good excuse to learn it. I will work on this some today and tomorrow but if I cannot find anything then I'm fine with merging it as is (after the changes mentioned above, of course).

nalabrie commented 1 month ago

I'm not at my Linux machine right now, so I can't run flamegraph. Apparently it doesn't work on Windows, or at least I can't get it to work. I should have time to do that tomorrow, but for now I wanted to at least provide some proper benchmarks. I have also made the requested changes to the PR @marcospb19.

I did some research and found this issue on the repo for the zstd crate that ouch is using. The owner mentions that there is overhead involved in using the "streaming API." Perhaps that is a cause of the slowdown compared to the zstd binary?

Benchmark Notes

Results

ouch (st) ouch (mt) zstd (mt)
Run 1 111.9764401 69.7551597 47.8371113
Run 2 112.2669172 69.3189215 47.6084749
Run 3 112.0817931 69.1581166 47.5799035
Run 4 111.8815577 69.2234262 47.9138429
Run 5 111.7506892 69.5047793 47.7037464
Run 6 112.0945928 69.2046794 47.5450366
Run 7 112.1228552 69.196134 47.8802425
Run 8 111.7783502 69.2298791 47.5509878
Run 9 112.0262394 69.2156705 48.0269953
Run 10 111.7842633 69.3845321 47.8702028
Average 111.97636982 69.31912984 47.7516544

In short, multi-threaded ouch takes about 38% less time than single-threaded ouch and the zstd binary takes about 31% less time than multi-threaded ouch.

nalabrie commented 1 month ago

Here is the output of flamegraph --root -- ./ouch c --level 20 silesia silesia.tar.zst. Ouch was compiled with the dev profile. flamegraph Judging by the comment preview, it looks like the svg isn't interactive like its supposed to be. If that is the case for anyone reading this, I have also uploaded the file here.

I can somewhat understand how to parse that information but I do not know where to begin when it comes to using that to optimize further. At least not at the moment, but when I have more free time I will possibly investigate further. I'm fine with merging it as is.

As a side note, here is more terminal output that flamegraph spit out. Not sure how important it is.

[ perf record: Woken up 5997 times to write data ]
Warning:
Processed 101962 events and lost 6 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 1504.683 MB perf.data (95002 samples) ]
writing flamegraph to "flamegraph.svg"

Another side note, it seems to run better on Linux compared to Windows (same hardware), but that isn't too surprising to me. The gap is still there between ouch and zstd, however. I only timed one test, so no averages.

marcospb19 commented 1 month ago

Oh thank you so much for all the investigation and effort you put into this! I highly appreciate it.

I did some research and found https://github.com/gyscos/zstd-rs/issues/266 on the repo for the zstd crate that ouch is using. The owner mentions that there is overhead involved in using the "streaming API." Perhaps that is a cause of the slowdown compared to the zstd binary?

That'd make sense! I didn't know that, and yeah we use the streaming API to chain encoders and decoders.

It also makes sense that it adds overhead since we make lots of small calls instead of just one.

I can somewhat understand how to parse that information but I do not know where to begin when it comes to using that to optimize further.

Yeah if you run flamegraph just against Ouch, you'll be a little lost.

To compare both tools you'd need to run flamegraph against ouch and zstd, and if we use the same implementation (and functions, I'm not entirely sure), then you'd see the same bars but with different proportions.

With that, by comparing the size of the bars you could get a rough comparison of where the extra time is spent.

However, on second thought, I don't advise you to investigate further, flamegraph sampling with multithreading can be unpredictable, and we don't know if we are calling the exact same functions, and you'd likely find out at the end that it's indeed a problem in the Streaming API.

Well, you can do it if you're still curious :smile: but you've done a lot for this PR, ty!