Parallel Pixel Maps - Githubissues

ripytide commented 6 months ago

This is just a draft so I can get some feedback on an initial implementation of map_subpixels_parallel() before I add the remaining variation of functions map_subpixels_parallel_mut() map_pixels_parallel() map_pixels_parallel_mut().

This would address #429

ripytide commented 6 months ago

I've just realized this was attempted in #446 but was closed due to being slower than the original. After reading the PR it looks like it used par_bridge as image may not have had raw par iterators 4 years ago which it now has. I'll do some benchmarks to test how my implementation is compared to the non-parallel version.

ripytide commented 6 months ago

Here are the results:

running 4 tests
test map::benches::bench_map_subpixels_100           ... bench:       9,856 ns/iter (+/- 1,715)
test map::benches::bench_map_subpixels_1000              ... bench:   1,047,116 ns/iter (+/- 42,256)
test map::benches::bench_map_subpixels_parallel_100   ... bench:      26,220 ns/iter (+/- 14,109)
test map::benches::bench_map_subpixels_parallel_1000   ... bench:     375,886 ns/iter (+/- 170,319)

As can be seen which function is faster depends on the size of the image, 100x100 images are faster with the non-parallel version whereas 1000x1000 images are faster with the parallel version.

Morgane55440 commented 6 months ago

my PC is slower than yours (probably all the other stuff i'm running), but i might have something to help for the inefficiencies for low values.

lots of time is lost on dispatching, so we need to dispatch less, by parallelizing on lines instead of pixels. of course, there is no support for this from the image crate, so i had to improvize.

here are my results : your impl on my pc

test map::benches::bench_map_subpixels_100                                            ... bench:      16,226 ns/iter (+/- 189)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,021,782 ns/iter (+/- 138,602)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      31,478 ns/iter (+/- 3,454)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     685,178 ns/iter (+/- 105,519)

my impl on my pc

test map::benches::bench_map_subpixels_100                                            ... bench:      16,320 ns/iter (+/- 214)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,001,935 ns/iter (+/- 114,785)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      15,008 ns/iter (+/- 2,176)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     634,393 ns/iter (+/- 158,002)

it is still much slower for very small images, so one could just default to sequential for images under 10k pixels

test map::benches::bench_map_subpixels_10                                             ... bench:         200 ns/iter (+/- 12)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,187 ns/iter (+/- 278)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,009,675 ns/iter (+/- 115,502)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:       5,757 ns/iter (+/- 1,303)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      15,283 ns/iter (+/- 1,232)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     699,168 ns/iter (+/- 148,572)

Morgane55440 commented 6 months ago

here is my take on the function body

{
    use itertools::Itertools;
    use rayon::iter::IndexedParallelIterator;
    use rayon::iter::ParallelIterator;
    use rayon::slice::{ParallelSlice, ParallelSliceMut};

    let (width, height) = image.dimensions();
    let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);

    image
        .par_chunks(width as usize * <P as Pixel>::CHANNEL_COUNT as usize)
        .zip_eq(
            out.par_chunks_mut(
                width as usize * <ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize,
            ),
        )
        .for_each(|(input_line, output_line)| {
            input_line
                .chunks_exact(<P as Pixel>::CHANNEL_COUNT as usize)
                .map(|v| <P as Pixel>::from_slice(v))
                .zip_eq(
                    output_line
                        .chunks_exact_mut(<ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize)
                        .map(|v| <ChannelMap<P, S> as Pixel>::from_slice_mut(v)),
                )
                .for_each(|(input_pixel, output_channels)| {
                    input_pixel
                        .channels()
                        .iter()
                        .zip(output_channels.channels_mut())
                        .for_each(|(input_subpixel, output_subpixel)| {
                            *output_subpixel = f(*input_subpixel)
                        })
                });
        });

    out
}

this is more efficient the faster the function is, which is great when it's just a sum or multiplication

ripytide commented 6 months ago

Great work, line based parallelism does seem like the best option for CPU cache since all the memory would be right next to each other. I wonder how these scale at 10000x10000.

Morgane55440 commented 6 months ago

i have a better iteration, that is both faster for small values and more readable

{
    use rayon::iter::IndexedParallelIterator;
    use rayon::iter::ParallelIterator;
    use rayon::slice::{ParallelSlice, ParallelSliceMut};

    let (width, height) = image.dimensions();
    let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);

    const SUBPIXEL_PER_CHUNK : usize = 512;
    image
        .par_chunks(SUBPIXEL_PER_CHUNK)
        .zip_eq(out.par_chunks_mut(SUBPIXEL_PER_CHUNK))
        .for_each(|(input_line, output_line)| {
            input_line.iter().zip(output_line.iter_mut()).for_each(
                |(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
            );
        });

    out
}

Morgane55440 commented 6 months ago

SUBPIXEL_PER_CHUNK is arbitrary, but 512 seems pretty good experimentally

ripytide commented 6 months ago

I've added you to my fork in case you want to update the implementation as we find faster versions.

Morgane55440 commented 6 months ago

Great work, line based parallelism does seem like the best option for CPU cache since all the memory would be right next to each other. I wonder how these scale at 10000x10000.

sadly for me it doesn't seem to improve much beyond 3x. might be my pc's fault

test map::benches::bench_map_subpixels_10                                             ... bench:         204 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,439 ns/iter (+/- 195)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,079,892 ns/iter (+/- 395,519)
test map::benches::bench_map_subpixels_10_000                                         ... bench: 222,443,410 ns/iter (+/- 20,580,424)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          66 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      12,425 ns/iter (+/- 3,846)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     700,052 ns/iter (+/- 121,973)
test map::benches::bench_map_subpixels_parallel_10_000                                ... bench:  61,087,340 ns/iter (+/- 5,226,359)

Morgane55440 commented 6 months ago

the values i get from my tests are really weird, i think there might be something weird going on.

it'd be nice if you could test it on your machine.

Morgane55440 commented 6 months ago

it says it's 10x faster for a 30 by 30 but only about 3x faster for a 1000 by 1000

test map::benches::bench_map_subpixels_10                                             ... bench:         192 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,343 ns/iter (+/- 233)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,019,145 ns/iter (+/- 997,715)
test map::benches::bench_map_subpixels_30                                             ... bench:       1,425 ns/iter (+/- 47)
test map::benches::bench_map_subpixels_60                                             ... bench:       5,580 ns/iter (+/- 86)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          65 ns/iter (+/- 0)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      13,283 ns/iter (+/- 1,725)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     728,398 ns/iter (+/- 118,186)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:         156 ns/iter (+/- 4)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:       2,558 ns/iter (+/- 1,018)

which is extra weird because for 30 by 30 there should only be one thread running so there shouldn't be any significant improvement

Morgane55440 commented 6 months ago

small question : why does map_subpixels_parallel only support ImageBuffer while all the other functions support GenericImage ?

ripytide commented 6 months ago

GenericImage doesn't require the image to be stored contiguously, whereas ImageBuffer does since it implements Deref<Target=[P::Subpixel]> which is needed for rayon methods like par_chunks(). Actually, on reading the source it looks like it implements Deref<[]> only when the Container also implements Deref<> which is the case for Vec. Maybe we could expand it to GenericImage with some bound like Deref<[]>?

Morgane55440 commented 6 months ago

well, i have bad news all the speed increase is solely due to the fact that we're working with ImageBuffer and contiguous memory, and removing the parralel bits makes it all faster including 1000 by 1000

ripytide commented 6 months ago

The benches on my machine (6-core Ryzen 5 4500U):

test map::benches::bench_map_subpixels_10                                             ... bench:         207 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,271 ns/iter (+/- 145)
test map::benches::bench_map_subpixels_1000                                           ... bench:   1,725,553 ns/iter (+/- 107,609)
test map::benches::bench_map_subpixels_30                                             ... bench:       1,467 ns/iter (+/- 12)
test map::benches::bench_map_subpixels_60                                             ... bench:       5,660 ns/iter (+/- 26)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          81 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:       9,735 ns/iter (+/- 1,870)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     243,560 ns/iter (+/- 43,263)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:         168 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:       7,178 ns/iter (+/- 1,279)

Morgane55440 commented 6 months ago

what happens if you replace the body by

{
    use itertools::Itertools;

    let (width, height) = image.dimensions();
    let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);

    image.iter().zip_eq(out.iter_mut()).for_each(
            |(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
        );

    out
}

?

ripytide commented 6 months ago

test map::benches::bench_map_subpixels_10                                             ... bench:         208 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,292 ns/iter (+/- 337)
test map::benches::bench_map_subpixels_1000                                           ... bench:   1,698,793 ns/iter (+/- 202,548)
test map::benches::bench_map_subpixels_30                                             ... bench:       1,464 ns/iter (+/- 19)
test map::benches::bench_map_subpixels_60                                             ... bench:       6,032 ns/iter (+/- 54)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          58 ns/iter (+/- 0)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:       1,035 ns/iter (+/- 9)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     384,083 ns/iter (+/- 22,183)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:         137 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:         393 ns/iter (+/- 2)

Morgane55440 commented 6 months ago

well it seems the parallel version is slightly faster for big images

ripytide commented 6 months ago

Well at least we've learnt that there is massive room for optimization via ImageBuffer over GenericImage hehe.

Morgane55440 commented 6 months ago

still a bit perplexed at how inefficient the parallelization is, there might be something else to it i'm missing

ripytide commented 6 months ago

From the cargo docs:

The --jobs argument affects the building of the benchmark executable but does not affect how many threads are used when running the benchmarks. The Rust test harness runs benchmarks serially in a single thread.

I think libtest may only be running programs with singlethreaded? Oh wait I think that just means they are run sequentially and not that they are limited to one thread when executing.

Morgane55440 commented 6 months ago

the parallelization is happening for sure

fishy_stuff

this is all very fishy

Morgane55440 commented 6 months ago

here is what happen if i remove the with_min_len (dramatic slow down for small images), but use more expensive computation as the input function (|x| f64::from(x).sqrt() as u8 + 1) :

test map::benches::bench_map_subpixels_parallel_10                                    ... bench:      15,389 ns/iter (+/- 2,866)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      40,917 ns/iter (+/- 5,783)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     911,286 ns/iter (+/- 100,335)
test map::benches::bench_map_subpixels_parallel_10_000                                ... bench:  79,796,480 ns/iter (+/- 8,970,528)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:      26,007 ns/iter (+/- 7,937)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:      33,506 ns/iter (+/- 7,006)
test map::benches::bench_map_subpixels_sequential_10                                  ... bench:         401 ns/iter (+/- 6)
test map::benches::bench_map_subpixels_sequential_100                                 ... bench:      37,682 ns/iter (+/- 605)
test map::benches::bench_map_subpixels_sequential_1000                                ... bench:   4,110,240 ns/iter (+/- 190,140)
test map::benches::bench_map_subpixels_sequential_10_000                              ... bench: 427,149,130 ns/iter (+/- 326,256,736)
test map::benches::bench_map_subpixels_sequential_30                                  ... bench:       3,537 ns/iter (+/- 572)
test map::benches::bench_map_subpixels_sequential_60                                  ... bench:      13,524 ns/iter (+/- 326)

the good value of MIN_SUBPIXEL_PER_CHUNK completely depends on the input functions ; if we had a function that took 1ms to compute, then the parallel version would be much faster even for a 10 by 10 image. therefore, imo, we should ship the parralel version without using with_min_len, explaining that it is slower than sequential for small images and fast functions, but can be significantly faster for large images(1 million pixel+) and slow fuctions

Morgane55440 commented 6 months ago

i have also come to realize that chunking manually is actually pointless, because that's already what rayon does

Morgane55440 commented 6 months ago

T_T

i thought i had done clippy on my machine

ripytide commented 6 months ago

I've de-duplicated the benchmark mapping function into the macro and switched map_subpixels_parallel() to be more similar to the rest of the mapping functions using GenericImage but with a Deref<[]> bound. I tested benchmarks before and after the commits and it had no effects as expected.

ripytide commented 6 months ago

The benchmarks take forever (200s) on my machine mainly due to the 10_000 functions (especially the sequential one), we'll probably have to remove those for the final PR.

Morgane55440 commented 6 months ago

makes sense

theotherphil commented 6 months ago

Thanks for the PR. I’ll wait until after the next release to merge this. Which will probably be next weekend as I don’t have much time on weekdays.

cospectrum commented 6 months ago

pub fn map_subpixels_parallel<P, Q, F>(image: &Image<P>, f: F) -> Image<Q>
where
    P: Pixel + Sync,
    P::Subpixel: Sync + Send,
    Q: Pixel,
    Q::Subpixel: Send,
    F: Fn(P::Subpixel) -> Q::Subpixel + Sync,
{
    use rayon::prelude::*;
    let buf = image
        .par_pixels()
        .flat_map(|p| p.channels())
        .copied()
        .map(|x| f(x))
        .collect();
    let (width, height) = image.dimensions();
    Image::from_vec(width, height, buf).unwrap()
}

cospectrum commented 6 months ago

I think all such functions should be in the image crate.

ripytide commented 6 months ago

I think all such functions should be in the image crate.

Interesting point. My opinion would be that anything that mutates or transforms images counts as image processing and so belongs in the imageproc crate whereas the image crate is for image representation, encoding and decoding. Under this rule these functions fall more under imageproc than image.

That said, the line here is quite blurry. I recently learned that image has a imageops module with several image processing operations such as contrast() and brighten() which is quite unsettling as I would think they belong in the imageproc crate.

theotherphil commented 6 months ago

The image crate existed before imageproc. Sorting out what goes where was one of the first issues on this repo… and we’ve pretty much just ignored it for the last nine years! https://github.com/image-rs/imageproc/issues/7

theotherphil commented 6 months ago

@ripytide this looks reasonable to me - let me know when it's in a reviewable state.

ripytide commented 6 months ago

I think we should talk through #622 first since the outcome of that will decide how many of the function variations to implement here.

ripytide commented 6 months ago

Okay now that all the blockers have been resolved I've fleshed out the rest of this PR. Some other changes I made:

Renamed red_channel() to into_red_channel() since that is more clear on the operation being done.
Renamed as_red_channel() to from_red_channel() for the same reason as above, also as_ functions are supposed to be cheap operations according to the rust API guidelines.
All signatures have changes from map_pixels<I, P>(image: &I) where I: GenericImage<P> to map_pixels<P>(image: &Image<P>) since it provides substantial performance improvements as it enforces contiguous storage. This is more consistent with other imageproc functions which use Image<P>'s and is more inline with the projects principle of performance over maximum genericness.

ripytide commented 6 months ago

@theotherphil this is now ready for review.

theotherphil commented 6 months ago

The new code looks good and the signature change makes sense. I’m not convinced by the name changes though. You’re right that as_ should be cheap, but there’s also a standard convention that into_ and from_ consume an owned value which isn’t the case here.

ripytide commented 6 months ago

That's true, but it is the lesser evil in my opinion as users can see their signatures fairly easily from the docs. Unless you have an alternative suggestion?

ripytide commented 6 months ago

One alternative: grayscale_from_red_channel() and rgb_from_red_grayscale()?

theotherphil commented 5 months ago

I’ll merge as is and think about what best to name these before the next release.

Thanks for the new functions, and it’s good to see your new doc macros already paying off.

image-rs / imageproc

Parallel Pixel Maps #602