Closed Shnatsel closed 1 year ago
Could you try switching to 0.24, provides a small but noticeable speed bump in decoding.
From my tests
0.2.3:enwiki: 154 Mb/s 0.2.4:enwiki: 161 Mb/s
0.2.3:tokio: 238 Mb/s 0.2.4:tokio: 258 Mb/s
0.2.4 actually hurt this benchmark, going from 128ms to 136ms.
How do the 0.2.4 changes affect the JSON benchmark?
In retrospect, the use of Zlib in exr
is probably not a great target for optimizing Zlib decoders for, since its outputs are so small. This is not the typical way Zlib data is used. I'd argue the use in PNG or JSON is a more important use case.
I didn't test on 0.2.3 for json but speeds should be relatively the same since all 0.2.4 adds is some branchless paths, removes redundant code and it stays slightly longer in the hot loop.
On Wed, 4 Jan 2023, 00:04 Shnatsel, @.***> wrote:
How do the 0.2.4 changes affect the JSON benchmark?
In retrospect, the use of Zlib in exr is probably not a great target for optimizing Zlib decoders for, since its outputs are so small. This is not the typical way Zlib data is used. I'd argue the use in PNG or JSON is a more important use case.
— Reply to this email directly, view it on GitHub https://github.com/johannesvollmer/exrs/pull/183#issuecomment-1370222824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZRVE6LV4BJ634JUH7FSDLWQSH57ANCNFSM6AAAAAATPDLK54 . You are receiving this because you commented.Message ID: @.***>
the size of the chunks can be adjusted in an exr image, meaning you can crank it up, while writing the image file. of course, any existing image might still be inefficient. the problem might rather be the default chuck size for zlib images. we could change that default size in exrs
No effect on the parallel version - apparently the main thread is the bottleneck.
can you explain how this happens? as the whole call to zlib is executed within a thread pool, I would expect no work to be done on the main thread - are there any global mutexes or some other single-thread operations in zune-inflate?
There are no mutexes or single threaded optimizations in zune-inflate.
Multi threaded implementations usually have a lot more things to worry about and that's why they really never scale perfectly
I believe that's what he means but I'll let him speak for himself.
On Fri, 6 Jan 2023, 23:36 Johannes Vollmer, @.***> wrote:
No effect on the parallel version - apparently the main thread is the bottleneck.
can you explain how this happens? as the whole call to zlib is executed within a thread pool, I would expect no work to be done on the main thread
- are there any mutexes or some other single-thread operations in zune-inflate?
— Reply to this email directly, view it on GitHub https://github.com/johannesvollmer/exrs/pull/183#issuecomment-1374103913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZRVE4FVDVXKL5KH5ONHGLWRB64PANCNFSM6AAAAAATPDLK54 . You are receiving this because you commented.Message ID: @.***>
Indeed, no zlib decompression happens on the main thread. Zune-inflate also has no mutexes or anything to do with multi-threading.
What I meant to say is that other things, done on the main thread, are probably what dominates the execution time. Let me explain.
In a single-threaded program, where all operations happen one after another, a change to any part of the program should result in a measurable change in the execution time. By contrast, if you measure the execution time of a multi-threaded program, you are only measuring the slowest thread; all others do not affect the total time at all. It doesn't matter if all the worker threads performing RLE or Zlib or whatever decompression exit in 10ms or 50ms if the main thread takes 100ms regardless.
So I thought that the main thread, that's not doing any zlib, is the slowest thread, and so the changes to Zlib will not be measurable on the parallel benchmark.
But since then I've run a benchmark (with no pixel format conversion) on a 16-core ARM system and saw parallel Zlib decompression go as low as 13ms, same as parallel RLE, while the parallel Zlib and RLE have different timings on my 4-core x86 system. So my new hypothesis is that the change to Zlib decompression time divided by 3 (number of worker threads actually running on a 4-core system) is below the noise threshold; it would only amount to 3ms given the single-threaded numbers.
thanks for your detailed response.
The process of reading a file in parallel is roughly as follows:
If we have any pixel conversion in the benchmark, or a slow byte source, or a slow insertion into the buffer, then parallel compression might not change the benchmark. If there is none of those in the benchmark, things will get interesting
as I'm not an expert in multi threading, there should be room for improvement in the code. it's located in src/blocks/reader.rs
if you are interested.
maybe we can solve some issues by using async io? assuming we use an async executor with multiple threads instead of using an explicit thread pool. we might not have multi threaded reading, but we might reduce the time that the threads wait for each other?
No, async I/O should make things strictly worse. The existing interfaces for async I/O from the filesystem are orders of magnitude slower than the synchronous versions, and nobody made a good wrapper for IO-uring yet (much less IOCP on Windows), and Mac straight up has no modern async I/O abstraction.
Async is only really relevant for web servers, and possibly clients that have to maintain a lot of idle connections. It doesn't make sense otherwise.
So I understand that once the metadata is read, you could read the file entirely in parallel? There's nothing stopping you from just assigning ranges of offsets into the file to worker threads and have those individual threads perform the actual reads as well as format conversions and decompression, is that correct?
Although they'd need to open a file descriptor for each thread, or perhaps use memory mapping so that you can simply pass slices to the worker threads and the actual reads would be handled implicitly by the OS. actually please don't, it's very spooky and will easily cause UB.
So I understand that once the metadata is read, you could read the file entirely in parallel? There's nothing stopping you from just assigning ranges of offsets into the file to worker threads and have those individual threads perform the actual reads as well as format conversions and decompression, is that correct?
Yes, the openexr format allows it. the meta data knows the seek offset (the file byte index) of every chunk. Therefore, every chunk could be read in parallel, in theory.
But as you noticed, there are other technical limitations, unfortunately. As I understand, multiple handles to the same file will never be faster than having one handle and seeking within that handle. The reason is that the bottleneck of reading a file is the bandwidth and not the cpu time
It's not that straightforward. There is actually a lot to gain from parallel reads on modern machines.
If the file is already cached in memory (e.g. it was just written, or it was recently read), like in our benchmarks, then reading the file is just a memcpy()
. Running memcpy()
on a single thread is not enough to saturate memory bandwidth on a typical workstation or even a decent desktop; I doubt it's possible even on cheap laptops with single-channel memory shared with the GPU. Parallel reads will definitely help here.
If the file is not cached in memory, it will need to be fetched from disk. If the disk is an SSD, there will be big speedups from reading a file in parallel. HDDs don't benefit nowhere near as much, but it also doesn't make things worse for them - slightly better, in fact.
Finally, reading the entire file into memory is something like 3x slower than reading it into a small buffer chunk by chunk. You can test this by making a cat
clone - one using std::fs::read_file
and the other with a small fixed-size buffer. This is because reading the entire file up front before it is processed involves the following steps:
If you use a small buffer that fits into the CPU cache instead, you can skip steps 2 and 3. Memory has high read/write latency, so skipping these steps translates to significant improvements.
that's great to hear. I'm thrilled to find out whether we can make this happen. multi threaded all the way without be absolutely gigantic, given these insights
I suggest fixing #178, cutting a release (I can help write the release announcement), and then trying it. It shouldn't be too hard. I won't attempt such a refactor on my own since it requires understanding of the current code, which I lack, but I'm happy to advise on performance, etc.
Sounds good. Tomorrow is the day of weekend I hoped to have time, maybe I'll get to it
Hi, for multi threading speedups , you need to be quite careful.
Haven't looked at the code (but will soon) but here are some tips that work for me
If the decompressed data has variable size, use split_at_mut for splitting data into variable chunks
Prevent as much sharing as possible , global sharing, mutexes, atomics are not your friends.
Have fun hacking around.
I to can help with some review.
thanks! we'll try that. shnatsel also already had the idea of opening multiple handles to the same file
zune-inflate doesn't handle compression yet, only decompression. So we still need miniz_oxide for compression.
oh right, I misread the code, alright
I'll be working on compression in ~1-2 months, will ping this then
On Thu, 19 Jan 2023, 21:13 Johannes Vollmer, @.***> wrote:
oh right, I misread the code, alright
— Reply to this email directly, view it on GitHub https://github.com/johannesvollmer/exrs/pull/183#issuecomment-1397409453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZRVE6LGF3R6OSN5S3H5N3WTF76FANCNFSM6AAAAAATPDLK54 . You are receiving this because you commented.Message ID: @.***>
Thanks for taking the time to contribute, and for the patience :)
Another stab at #179 now that zune-inflate is fuzzed and roundtrip-fuzzed on CI.
Before:
After:
No effect on the parallel version - apparently the main thread is the bottleneck.