Fully Parallel Block Reader/Decompressor/Interleaver

johannesvollmer commented 1 year ago

sketch some ideas

johannesvollmer commented 1 year ago

current approach is to spawn ALL the threads at once. the threads open the files individually. use the threadpool size to control how many file handles are opened simultaneously. however, this requires the closure which opens the file to be Sync, might be a problem

johannesvollmer commented 1 year ago

@Shnatsel does this look like a good approach to you? (pseudo code)

johannesvollmer commented 1 year ago

made the code compile somehow. required a lot of changes that i did not see coming. here's the benchmark with the current code (opening a new file handle per block, which means hundreds per thread, which is slow, as expected)

test read_single_image_uncompressed_rgba                ... bench:  16,190,040 ns/iter (+/- 1,368,617)
test read_single_image_uncompressed_rgba_fully_parallel ... bench:  41,465,800 ns/iter (+/- 8,681,578)

Shnatsel commented 1 year ago

Benchmarking reading to f16 instead of to f32 might be more interesting. That way we'll see the effects of the format conversion being parallelized.

If you send me the file or include the file in the repo I can profile the code to see why it's slow. Or you can profile it yourself as I described in the bounds checks article.

johannesvollmer commented 1 year ago

the file is not special. it's from the repo. I just needed a quick way to read from SSD, so I copied it to another Drive (the repository happened to be on my HDD)

currently no conversion is done at all (and adding it will require some more code, which is why it's not there yet). but shouldn't it already be faster? right now, the same file is opened multiple times, and then a buffer is read from it once, but no work is actually done, it's sent to a black box immediately

this current state of the code represents the use case of reading an uncompressed file without conversion and without interleaving of pixels, which is a valid use case and can happen every where all_channels() is used

johannesvollmer commented 1 year ago

great to see you article online! nice read, and quite some surprising insights :o

johannesvollmer commented 1 year ago

Got some better performance by avoiding to clone the file handle, using map_init from rayon see this pseudo code:

chunk_offsets.into_par_iter()
        .map_init(
            // called approximately once for each thread, sometimes slightly more
            || open_new_file_handle(),

            // reuse the open file handle
            |file_handle, chunk_offset| {
                file_handle.seek(SeekFrom::Start(chunk_offset))?;
                process_chunk(file_handle)
            }
        )
        .try_for_each(|block| sender.send(block))?;

test read_single_image_uncompressed_rgba                      ... bench:  15,653,920 ns/iter (+/- 3,058,044)
test read_single_image_uncompressed_rgba_fully_parallel       ... bench:  40,487,920 ns/iter (+/- 5,594,998)
test read_single_image_uncompressed_rgba_fully_parallel_rayon ... bench:  28,722,050 ns/iter (+/- 6,439,878)

Note that the first benchmark does not use any parallelism at all, and does linear reads from file start through file end. These tests show that with the current code, opening multiple file handles is never faster than opening one file handle and reading from it linearly. Will it ever be possible to outperform that?

Update: When simulating some basic placeholder work, these are more realistic numbers:

test read_single_image_uncompressed_rgba                      ... bench:  17,717,370 ns/iter (+/- 4,937,336)
test read_single_image_uncompressed_rgba_fully_parallel       ... bench:  40,814,200 ns/iter (+/- 8,148,082)
test read_single_image_uncompressed_rgba_fully_parallel_rayon ... bench:  30,777,840 ns/iter (+/- 6,103,551)

Note that we cannot add any conversion, as it would spoil the comparison. This is because the old architecture will perform conversion on the main thread, and not on multiple threads. But if we want to find out whether multiple file handles could speed things up, we can't use it, because we would not know if the speedup is due to multiple files or due to multithreaded conversion.

Nevertheless, here's numbers with compressed files:

rle
test decompress_parallel               ... bench:  21,020,970 ns/iter (+/- 3,801,018)
test fully_parallel_many_file_handles  ... bench:  46,940,340 ns/iter (+/- 5,820,639)
test fully_parallel_fewer_file_handles ... bench:  41,590,830 ns/iter (+/- 4,775,950)

zip1
test decompress_parallel               ... bench:  22,217,420 ns/iter (+/- 2,736,300)
test fully_parallel_many_file_handles  ... bench:  52,737,040 ns/iter (+/- 11,848,624)
test fully_parallel_fewer_file_handles ... bench:  35,923,730 ns/iter (+/- 5,911,908)

johannesvollmer commented 1 year ago

by the way, I'm pretty sure we can do parallel sample conversion with the current system right now. even if we can't find out how to implement the multiple file handles yet

johannesvollmer commented 12 months ago

with f16 conversion speed that high, and no major speedup by using multiple file handles, closing this for now

johannesvollmer / exrs

Fully Parallel Block Reader/Decompressor/Interleaver #194