johannesvollmer / exrs

100% Safe Rust OpenEXR file library
Other
149 stars 22 forks source link

Use zune-inflate for decompression, take 2 #183

Closed Shnatsel closed 1 year ago

Shnatsel commented 1 year ago

Another stab at #179 now that zune-inflate is fuzzed and roundtrip-fuzzed on CI.

Before:

test read_single_image_zips_non_parallel_rgba         ... bench: 121,828,074 ns/iter (+/- 130,167)

After:

test read_single_image_zips_non_parallel_rgba         ... bench: 113,364,196 ns/iter (+/- 93,985)

No effect on the parallel version - apparently the main thread is the bottleneck.

etemesi254 commented 1 year ago

Could you try switching to 0.24, provides a small but noticeable speed bump in decoding.

From my tests

0.2.3:enwiki: 154 Mb/s 0.2.4:enwiki: 161 Mb/s

0.2.3:tokio: 238 Mb/s 0.2.4:tokio: 258 Mb/s

Shnatsel commented 1 year ago

0.2.4 actually hurt this benchmark, going from 128ms to 136ms.

Shnatsel commented 1 year ago

How do the 0.2.4 changes affect the JSON benchmark?

In retrospect, the use of Zlib in exr is probably not a great target for optimizing Zlib decoders for, since its outputs are so small. This is not the typical way Zlib data is used. I'd argue the use in PNG or JSON is a more important use case.

etemesi254 commented 1 year ago

I didn't test on 0.2.3 for json but speeds should be relatively the same since all 0.2.4 adds is some branchless paths, removes redundant code and it stays slightly longer in the hot loop.

On Wed, 4 Jan 2023, 00:04 Shnatsel, @.***> wrote:

How do the 0.2.4 changes affect the JSON benchmark?

In retrospect, the use of Zlib in exr is probably not a great target for optimizing Zlib decoders for, since its outputs are so small. This is not the typical way Zlib data is used. I'd argue the use in PNG or JSON is a more important use case.

— Reply to this email directly, view it on GitHub https://github.com/johannesvollmer/exrs/pull/183#issuecomment-1370222824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZRVE6LV4BJ634JUH7FSDLWQSH57ANCNFSM6AAAAAATPDLK54 . You are receiving this because you commented.Message ID: @.***>

johannesvollmer commented 1 year ago

the size of the chunks can be adjusted in an exr image, meaning you can crank it up, while writing the image file. of course, any existing image might still be inefficient. the problem might rather be the default chuck size for zlib images. we could change that default size in exrs

johannesvollmer commented 1 year ago

No effect on the parallel version - apparently the main thread is the bottleneck.

can you explain how this happens? as the whole call to zlib is executed within a thread pool, I would expect no work to be done on the main thread - are there any global mutexes or some other single-thread operations in zune-inflate?

etemesi254 commented 1 year ago

There are no mutexes or single threaded optimizations in zune-inflate.

Multi threaded implementations usually have a lot more things to worry about and that's why they really never scale perfectly

I believe that's what he means but I'll let him speak for himself.

On Fri, 6 Jan 2023, 23:36 Johannes Vollmer, @.***> wrote:

No effect on the parallel version - apparently the main thread is the bottleneck.

can you explain how this happens? as the whole call to zlib is executed within a thread pool, I would expect no work to be done on the main thread

  • are there any mutexes or some other single-thread operations in zune-inflate?

— Reply to this email directly, view it on GitHub https://github.com/johannesvollmer/exrs/pull/183#issuecomment-1374103913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZRVE4FVDVXKL5KH5ONHGLWRB64PANCNFSM6AAAAAATPDLK54 . You are receiving this because you commented.Message ID: @.***>

Shnatsel commented 1 year ago

Indeed, no zlib decompression happens on the main thread. Zune-inflate also has no mutexes or anything to do with multi-threading.

What I meant to say is that other things, done on the main thread, are probably what dominates the execution time. Let me explain.

In a single-threaded program, where all operations happen one after another, a change to any part of the program should result in a measurable change in the execution time. By contrast, if you measure the execution time of a multi-threaded program, you are only measuring the slowest thread; all others do not affect the total time at all. It doesn't matter if all the worker threads performing RLE or Zlib or whatever decompression exit in 10ms or 50ms if the main thread takes 100ms regardless.

So I thought that the main thread, that's not doing any zlib, is the slowest thread, and so the changes to Zlib will not be measurable on the parallel benchmark.

But since then I've run a benchmark (with no pixel format conversion) on a 16-core ARM system and saw parallel Zlib decompression go as low as 13ms, same as parallel RLE, while the parallel Zlib and RLE have different timings on my 4-core x86 system. So my new hypothesis is that the change to Zlib decompression time divided by 3 (number of worker threads actually running on a 4-core system) is below the noise threshold; it would only amount to 3ms given the single-threaded numbers.

johannesvollmer commented 1 year ago

thanks for your detailed response.

The process of reading a file in parallel is roughly as follows:

  1. The meta data is read strictly sequentially from the beginning of the byte source. This happens on main thread
  2. The rest of the byte source contains self-contained chunks of compressed pixels without meta data. The chunks contain pixels for different sections of the image. They are read either one after another, or possibly the file might be seeked if some chunks are not desired (e.g mip maps). While the blocks are read from the file, they are already decompressed in a pool. Therefore, the operations happening on the main thread are only the reading of the compressed byte chunks from the source, and then after a block has been decompressed on another thread, the pixel format conversion and the insertion into the pixel into the image buffer. Reading and conversion/insertion happens for each block, so it will happen interleaved on the main thread.

If we have any pixel conversion in the benchmark, or a slow byte source, or a slow insertion into the buffer, then parallel compression might not change the benchmark. If there is none of those in the benchmark, things will get interesting

johannesvollmer commented 1 year ago

as I'm not an expert in multi threading, there should be room for improvement in the code. it's located in src/blocks/reader.rs if you are interested.

maybe we can solve some issues by using async io? assuming we use an async executor with multiple threads instead of using an explicit thread pool. we might not have multi threaded reading, but we might reduce the time that the threads wait for each other?

Shnatsel commented 1 year ago

No, async I/O should make things strictly worse. The existing interfaces for async I/O from the filesystem are orders of magnitude slower than the synchronous versions, and nobody made a good wrapper for IO-uring yet (much less IOCP on Windows), and Mac straight up has no modern async I/O abstraction.

Async is only really relevant for web servers, and possibly clients that have to maintain a lot of idle connections. It doesn't make sense otherwise.

Shnatsel commented 1 year ago

So I understand that once the metadata is read, you could read the file entirely in parallel? There's nothing stopping you from just assigning ranges of offsets into the file to worker threads and have those individual threads perform the actual reads as well as format conversions and decompression, is that correct?

Although they'd need to open a file descriptor for each thread, or perhaps use memory mapping so that you can simply pass slices to the worker threads and the actual reads would be handled implicitly by the OS. actually please don't, it's very spooky and will easily cause UB.

johannesvollmer commented 1 year ago

So I understand that once the metadata is read, you could read the file entirely in parallel? There's nothing stopping you from just assigning ranges of offsets into the file to worker threads and have those individual threads perform the actual reads as well as format conversions and decompression, is that correct?

Yes, the openexr format allows it. the meta data knows the seek offset (the file byte index) of every chunk. Therefore, every chunk could be read in parallel, in theory.

But as you noticed, there are other technical limitations, unfortunately. As I understand, multiple handles to the same file will never be faster than having one handle and seeking within that handle. The reason is that the bottleneck of reading a file is the bandwidth and not the cpu time

Shnatsel commented 1 year ago

It's not that straightforward. There is actually a lot to gain from parallel reads on modern machines.

If the file is already cached in memory (e.g. it was just written, or it was recently read), like in our benchmarks, then reading the file is just a memcpy(). Running memcpy() on a single thread is not enough to saturate memory bandwidth on a typical workstation or even a decent desktop; I doubt it's possible even on cheap laptops with single-channel memory shared with the GPU. Parallel reads will definitely help here.

If the file is not cached in memory, it will need to be fetched from disk. If the disk is an SSD, there will be big speedups from reading a file in parallel. HDDs don't benefit nowhere near as much, but it also doesn't make things worse for them - slightly better, in fact.

Finally, reading the entire file into memory is something like 3x slower than reading it into a small buffer chunk by chunk. You can test this by making a cat clone - one using std::fs::read_file and the other with a small fixed-size buffer. This is because reading the entire file up front before it is processed involves the following steps:

  1. Copy the data from the page cache (in RAM) to the CPU cache
  2. Write the CPU cache contents to RAM to make room for more data
  3. Once the chunk is needed, load it from memory again

If you use a small buffer that fits into the CPU cache instead, you can skip steps 2 and 3. Memory has high read/write latency, so skipping these steps translates to significant improvements.

johannesvollmer commented 1 year ago

that's great to hear. I'm thrilled to find out whether we can make this happen. multi threaded all the way without be absolutely gigantic, given these insights

Shnatsel commented 1 year ago

I suggest fixing #178, cutting a release (I can help write the release announcement), and then trying it. It shouldn't be too hard. I won't attempt such a refactor on my own since it requires understanding of the current code, which I lack, but I'm happy to advise on performance, etc.

johannesvollmer commented 1 year ago

Sounds good. Tomorrow is the day of weekend I hoped to have time, maybe I'll get to it

etemesi254 commented 1 year ago

Hi, for multi threading speedups , you need to be quite careful.

Haven't looked at the code (but will soon) but here are some tips that work for me

  1. If the decompressed data is going to end up in one big vector, send chunks to worker threads,use chunks_exact_mut and absolutely no mutexes.

If the decompressed data has variable size, use split_at_mut for splitting data into variable chunks

  1. Read metadata , then get chunks and then send everything to the worker threads, i.e. let each worker thread reopen the file , re-read it from a given offset and do processing.

Prevent as much sharing as possible , global sharing, mutexes, atomics are not your friends.

Have fun hacking around.

I to can help with some review.

johannesvollmer commented 1 year ago

thanks! we'll try that. shnatsel also already had the idea of opening multiple handles to the same file

Shnatsel commented 1 year ago

zune-inflate doesn't handle compression yet, only decompression. So we still need miniz_oxide for compression.

johannesvollmer commented 1 year ago

oh right, I misread the code, alright

etemesi254 commented 1 year ago

I'll be working on compression in ~1-2 months, will ping this then

On Thu, 19 Jan 2023, 21:13 Johannes Vollmer, @.***> wrote:

oh right, I misread the code, alright

— Reply to this email directly, view it on GitHub https://github.com/johannesvollmer/exrs/pull/183#issuecomment-1397409453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZRVE6LGF3R6OSN5S3H5N3WTF76FANCNFSM6AAAAAATPDLK54 . You are receiving this because you commented.Message ID: @.***>

johannesvollmer commented 1 year ago

Thanks for taking the time to contribute, and for the patience :)