nicolas-comerci / precomp-cpp

Precomp, C++ version - further compress already compressed files
http://schnaader.info/precomp.php
Apache License 2.0
29 stars 2 forks source link

Question: Multithreading/Speed #7

Open Merculous opened 1 year ago

Merculous commented 1 year ago

I'm aware of you leaving precomp's verify method default, which is great, and have said it'll be slower. Now, we can all agree that's good, and I already know it will be slower.

I'm asking about whether precomp is always multithreaded or not. I'd like to get the best output from precomp using -intense and -brute, now I'm fully aware of using those options will greatly slow down the whole process. Anyway, so while testing precomp's speed, I also was using htop, and I noticed (guessing here cause I don't remember) towards the end, so maybe like 80%, cpu finally was above 100% cpu usage.

I know that some programs are made so that cpu usage is low, but in my case, cpu and memory usage I absolutely don't care about. So, will precomp always make use of multithreading for everything, or does it take affect when applicable? I noticed at least with bzip3, that it doesn't always have the same amount of threads running, and it will either increase or decrease assuming on the data? So basically, is precomp a bit slow because of what data is being processed or is just a threshold thing?

Hopefully this makes sense? I don't know of any other program that does this, and I'm actually happy you've added the verify routine. I'm not picky about using precomp to compress data more, but when I did a test on a file that was roughly 2GB, it was like 11/12 minutes just to process the file, which my script outputted like a 2-5 MB/s speed (I don't remember but it was around there). I'm just not sure if multithreading is already being used and if it was, it seems like precomp is doing redundant computation?

I have no idea what I'm talking about in terms of actual work but I just know enough to figure out what's going on at a high level. I would love to help but I only know Python and I've tried working with C and I always give up :/

P.S. When I'm saying multithreading, I just mean cpu usage is going more than 100%.

nicolas-comerci commented 1 year ago

Precomp itself is essentially single threaded. Every byte is processed one by one and every enabled format handler is used one by one on those bytes. All formats but brute deflate have a very quick check, so that's not really a problem, other than for brute mode, which does slow down detection severely.

After Precomp determines that it appears that it actually found a stream, it hands execution to the appropriate format handler. It's up to those to spawn more threads if it can use them. In particular, for all handlers based upon Preflate, that is Brute/Deflate, Intense/Zlib, ZIP, GZIP, PNG, PDF and SWF, Multithreading is used as Preflate itself handles it. I don't think any of the other format handlers supports MT. For example you mentioned BZip2, which is single threaded both for compression and decompression. There is an LBZip2 version that supports MT, but I would have to research if it produces identical/close enough compressed streams so that using it would be possible to use it. Essentially there is no meaningful way for Precomp to just be able to do MT regardless of the stream format being precompressed/recompressed, it's highly dependent on the format and thus it's handler.

It gets worse, even for format handlers that support MT, like those that ultimately use Preflate, for proper use of MT, most formats need to separate the data into blocks or chunks, which means that a minimum of data is needed for each thread. In Preflate's example, the size of each of this blocks is around 2MB by default. It can be made smaller but then you can even make stuff even slower because there is a base constant cost for the preparation of each block. In essence, this means that while Preflate's MT works beautifully for large single stream files (like a large .tar.gz or something like that, you should be able to see high CPU usage while processing one like that, especially with -d0) for files with many small streams (or when recursing inside the now decompressed .tar.gz) it will essentially devolve into single threaded performance, and there is not much that can be done about that, at least not easily.

For precompression at least, there are things that can be done to improve performance possibly, but there is no silver bullet. For recompression, things can be a little easier to improve, for small streams it's probably attainable to read many of them and recompress them at the same time (because Precomp leaves block headers that tells us the size of the stream), but that's not how it works right now, and I have some other things I am working on before I would like to attempt it.

In summary:

Maybe a little to detailed answer but I will probably copy and paste it into a FAQ page later, so there you go.

Merculous commented 1 year ago

Thanks for the reply. That actually is a good explanation and it doesn't hurt to add some context, which is indeed helpful (for me at least, I'm not working on this). I know certain applications, MT won't do much, basically how LZ4 is already, I'm assuming is already the fastest type of compression you can get.

Well, you've already have made things make more sense for a dummy like me, so thanks for the help.

Just curious, what's this bounded context branch? After that, feel free to close. I would close now but I might as well ask what that branch is for, since I have no idea what the applies to, cause you know, I have no idea how precomp works.

nicolas-comerci commented 1 year ago

Don't beat yourself up. I remember wanting to contribute to Precomp back on 2018, pulled the repo but was uncapable to properly comprehend the code at the time. If you like programming just keep learning, it will come with time.

The Bounded context branch is for the same feature that I also called Partial Streams, though I guess you could consider those 2 different features, and is essentially what will be v0.5.0

The whole point of v0.5.0 is going to be allowing piped and non seekable input for precompression. Right now Precomp has to be able to do seek operations on the input stream, to "rewind" after trying to use a format handler in case precompression fails. That branch, and thus v0.5.0, will be able to limit the amount of rewind that might need to be done, to a small enough buffer that we might fit in memory, say anything from 100-1000mb seems reasonable. That is what I mean with "bounded context" the amount of data Precomp will need to care about (the context) will be limited (that is, bounded).

This requires quite a reengineering on the code, as format handlers need to be able to output partial precompressed data (of at most the size of the small buffer previously mentioned) and recompress that partial data on portions of the original data stream. But once a block of partial precompressed data is verified and outputted, the original data can be safely discarded, which is what allows us to continue reading from the original input stream without ever exceeding the buffer.

Even better, with these changes, I can support partial streams, that is if there is a stream that either is not complete (imagine an incomplete download inside a hard drive image) or that fails partway, we will be able to just precompress as much of the stream as we can up until it is truncated or the format handler fails. This will improve compression ratios but also speed up specially brute mode (the more data we are able to actually process is less data we need to look for streams).

I already have the changes needed for most of this for the deflate based handlers and BZip2. I will probably only do it for Base64 as well and JPG/GIF/MP3 won't have this support at least for v0.5.0, but hopefully those files shouldn't ever be big enough to be much of a problem.

So consider this an early changelog for v0.5.0: