Closed ripytide closed 5 months ago
I've just realized this was attempted in #446 but was closed due to being slower than the original. After reading the PR it looks like it used par_bridge
as image
may not have had raw par
iterators 4 years ago which it now has. I'll do some benchmarks to test how my implementation is compared to the non-parallel version.
Here are the results:
running 4 tests
test map::benches::bench_map_subpixels_100 ... bench: 9,856 ns/iter (+/- 1,715)
test map::benches::bench_map_subpixels_1000 ... bench: 1,047,116 ns/iter (+/- 42,256)
test map::benches::bench_map_subpixels_parallel_100 ... bench: 26,220 ns/iter (+/- 14,109)
test map::benches::bench_map_subpixels_parallel_1000 ... bench: 375,886 ns/iter (+/- 170,319)
As can be seen which function is faster depends on the size of the image, 100x100 images are faster with the non-parallel version whereas 1000x1000 images are faster with the parallel version.
my PC is slower than yours (probably all the other stuff i'm running), but i might have something to help for the inefficiencies for low values.
lots of time is lost on dispatching, so we need to dispatch less, by parallelizing on lines instead of pixels. of course, there is no support for this from the image crate, so i had to improvize.
here are my results : your impl on my pc
test map::benches::bench_map_subpixels_100 ... bench: 16,226 ns/iter (+/- 189)
test map::benches::bench_map_subpixels_1000 ... bench: 2,021,782 ns/iter (+/- 138,602)
test map::benches::bench_map_subpixels_parallel_100 ... bench: 31,478 ns/iter (+/- 3,454)
test map::benches::bench_map_subpixels_parallel_1000 ... bench: 685,178 ns/iter (+/- 105,519)
my impl on my pc
test map::benches::bench_map_subpixels_100 ... bench: 16,320 ns/iter (+/- 214)
test map::benches::bench_map_subpixels_1000 ... bench: 2,001,935 ns/iter (+/- 114,785)
test map::benches::bench_map_subpixels_parallel_100 ... bench: 15,008 ns/iter (+/- 2,176)
test map::benches::bench_map_subpixels_parallel_1000 ... bench: 634,393 ns/iter (+/- 158,002)
it is still much slower for very small images, so one could just default to sequential for images under 10k pixels
test map::benches::bench_map_subpixels_10 ... bench: 200 ns/iter (+/- 12)
test map::benches::bench_map_subpixels_100 ... bench: 16,187 ns/iter (+/- 278)
test map::benches::bench_map_subpixels_1000 ... bench: 2,009,675 ns/iter (+/- 115,502)
test map::benches::bench_map_subpixels_parallel_10 ... bench: 5,757 ns/iter (+/- 1,303)
test map::benches::bench_map_subpixels_parallel_100 ... bench: 15,283 ns/iter (+/- 1,232)
test map::benches::bench_map_subpixels_parallel_1000 ... bench: 699,168 ns/iter (+/- 148,572)
here is my take on the function body
{
use itertools::Itertools;
use rayon::iter::IndexedParallelIterator;
use rayon::iter::ParallelIterator;
use rayon::slice::{ParallelSlice, ParallelSliceMut};
let (width, height) = image.dimensions();
let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);
image
.par_chunks(width as usize * <P as Pixel>::CHANNEL_COUNT as usize)
.zip_eq(
out.par_chunks_mut(
width as usize * <ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize,
),
)
.for_each(|(input_line, output_line)| {
input_line
.chunks_exact(<P as Pixel>::CHANNEL_COUNT as usize)
.map(|v| <P as Pixel>::from_slice(v))
.zip_eq(
output_line
.chunks_exact_mut(<ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize)
.map(|v| <ChannelMap<P, S> as Pixel>::from_slice_mut(v)),
)
.for_each(|(input_pixel, output_channels)| {
input_pixel
.channels()
.iter()
.zip(output_channels.channels_mut())
.for_each(|(input_subpixel, output_subpixel)| {
*output_subpixel = f(*input_subpixel)
})
});
});
out
}
this is more efficient the faster the function is, which is great when it's just a sum or multiplication
Great work, line based parallelism does seem like the best option for CPU cache since all the memory would be right next to each other. I wonder how these scale at 10000x10000.
i have a better iteration, that is both faster for small values and more readable
{
use rayon::iter::IndexedParallelIterator;
use rayon::iter::ParallelIterator;
use rayon::slice::{ParallelSlice, ParallelSliceMut};
let (width, height) = image.dimensions();
let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);
const SUBPIXEL_PER_CHUNK : usize = 512;
image
.par_chunks(SUBPIXEL_PER_CHUNK)
.zip_eq(out.par_chunks_mut(SUBPIXEL_PER_CHUNK))
.for_each(|(input_line, output_line)| {
input_line.iter().zip(output_line.iter_mut()).for_each(
|(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
);
});
out
}
SUBPIXEL_PER_CHUNK
is arbitrary, but 512 seems pretty good experimentally
I've added you to my fork in case you want to update the implementation as we find faster versions.
Great work, line based parallelism does seem like the best option for CPU cache since all the memory would be right next to each other. I wonder how these scale at 10000x10000.
sadly for me it doesn't seem to improve much beyond 3x. might be my pc's fault
test map::benches::bench_map_subpixels_10 ... bench: 204 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100 ... bench: 16,439 ns/iter (+/- 195)
test map::benches::bench_map_subpixels_1000 ... bench: 2,079,892 ns/iter (+/- 395,519)
test map::benches::bench_map_subpixels_10_000 ... bench: 222,443,410 ns/iter (+/- 20,580,424)
test map::benches::bench_map_subpixels_parallel_10 ... bench: 66 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_100 ... bench: 12,425 ns/iter (+/- 3,846)
test map::benches::bench_map_subpixels_parallel_1000 ... bench: 700,052 ns/iter (+/- 121,973)
test map::benches::bench_map_subpixels_parallel_10_000 ... bench: 61,087,340 ns/iter (+/- 5,226,359)
the values i get from my tests are really weird, i think there might be something weird going on.
it'd be nice if you could test it on your machine.
it says it's 10x faster for a 30 by 30 but only about 3x faster for a 1000 by 1000
test map::benches::bench_map_subpixels_10 ... bench: 192 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_100 ... bench: 16,343 ns/iter (+/- 233)
test map::benches::bench_map_subpixels_1000 ... bench: 2,019,145 ns/iter (+/- 997,715)
test map::benches::bench_map_subpixels_30 ... bench: 1,425 ns/iter (+/- 47)
test map::benches::bench_map_subpixels_60 ... bench: 5,580 ns/iter (+/- 86)
test map::benches::bench_map_subpixels_parallel_10 ... bench: 65 ns/iter (+/- 0)
test map::benches::bench_map_subpixels_parallel_100 ... bench: 13,283 ns/iter (+/- 1,725)
test map::benches::bench_map_subpixels_parallel_1000 ... bench: 728,398 ns/iter (+/- 118,186)
test map::benches::bench_map_subpixels_parallel_30 ... bench: 156 ns/iter (+/- 4)
test map::benches::bench_map_subpixels_parallel_60 ... bench: 2,558 ns/iter (+/- 1,018)
which is extra weird because for 30 by 30 there should only be one thread running so there shouldn't be any significant improvement
small question : why does map_subpixels_parallel
only support ImageBuffer
while all the other functions support GenericImage
?
GenericImage
doesn't require the image to be stored contiguously, whereas ImageBuffer
does since it implements Deref<Target=[P::Subpixel]>
which is needed for rayon
methods like par_chunks()
. Actually, on reading the source it looks like it implements Deref<[]>
only when the Container
also implements Deref<>
which is the case for Vec
. Maybe we could expand it to GenericImage
with some bound like Deref<[]>
?
well, i have bad news
all the speed increase is solely due to the fact that we're working with ImageBuffer
and contiguous memory, and removing the parralel bits makes it all faster including 1000 by 1000
The benches on my machine (6-core Ryzen 5 4500U):
test map::benches::bench_map_subpixels_10 ... bench: 207 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100 ... bench: 16,271 ns/iter (+/- 145)
test map::benches::bench_map_subpixels_1000 ... bench: 1,725,553 ns/iter (+/- 107,609)
test map::benches::bench_map_subpixels_30 ... bench: 1,467 ns/iter (+/- 12)
test map::benches::bench_map_subpixels_60 ... bench: 5,660 ns/iter (+/- 26)
test map::benches::bench_map_subpixels_parallel_10 ... bench: 81 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_100 ... bench: 9,735 ns/iter (+/- 1,870)
test map::benches::bench_map_subpixels_parallel_1000 ... bench: 243,560 ns/iter (+/- 43,263)
test map::benches::bench_map_subpixels_parallel_30 ... bench: 168 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_60 ... bench: 7,178 ns/iter (+/- 1,279)
what happens if you replace the body by
{
use itertools::Itertools;
let (width, height) = image.dimensions();
let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);
image.iter().zip_eq(out.iter_mut()).for_each(
|(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
);
out
}
?
test map::benches::bench_map_subpixels_10 ... bench: 208 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100 ... bench: 16,292 ns/iter (+/- 337)
test map::benches::bench_map_subpixels_1000 ... bench: 1,698,793 ns/iter (+/- 202,548)
test map::benches::bench_map_subpixels_30 ... bench: 1,464 ns/iter (+/- 19)
test map::benches::bench_map_subpixels_60 ... bench: 6,032 ns/iter (+/- 54)
test map::benches::bench_map_subpixels_parallel_10 ... bench: 58 ns/iter (+/- 0)
test map::benches::bench_map_subpixels_parallel_100 ... bench: 1,035 ns/iter (+/- 9)
test map::benches::bench_map_subpixels_parallel_1000 ... bench: 384,083 ns/iter (+/- 22,183)
test map::benches::bench_map_subpixels_parallel_30 ... bench: 137 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_60 ... bench: 393 ns/iter (+/- 2)
well it seems the parallel version is slightly faster for big images
Well at least we've learnt that there is massive room for optimization via ImageBuffer
over GenericImage
hehe.
still a bit perplexed at how inefficient the parallelization is, there might be something else to it i'm missing
From the cargo docs:
The --jobs argument affects the building of the benchmark executable but does not affect how many threads are used when running the benchmarks. The Rust test harness runs benchmarks serially in a single thread.
I think libtest
may only be running programs with singlethreaded? Oh wait I think that just means they are run sequentially and not that they are limited to one thread when executing.
the parallelization is happening for sure
this is all very fishy
here is what happen if i remove the with_min_len
(dramatic slow down for small images), but use more expensive computation as the input function (|x| f64::from(x).sqrt() as u8 + 1
) :
test map::benches::bench_map_subpixels_parallel_10 ... bench: 15,389 ns/iter (+/- 2,866)
test map::benches::bench_map_subpixels_parallel_100 ... bench: 40,917 ns/iter (+/- 5,783)
test map::benches::bench_map_subpixels_parallel_1000 ... bench: 911,286 ns/iter (+/- 100,335)
test map::benches::bench_map_subpixels_parallel_10_000 ... bench: 79,796,480 ns/iter (+/- 8,970,528)
test map::benches::bench_map_subpixels_parallel_30 ... bench: 26,007 ns/iter (+/- 7,937)
test map::benches::bench_map_subpixels_parallel_60 ... bench: 33,506 ns/iter (+/- 7,006)
test map::benches::bench_map_subpixels_sequential_10 ... bench: 401 ns/iter (+/- 6)
test map::benches::bench_map_subpixels_sequential_100 ... bench: 37,682 ns/iter (+/- 605)
test map::benches::bench_map_subpixels_sequential_1000 ... bench: 4,110,240 ns/iter (+/- 190,140)
test map::benches::bench_map_subpixels_sequential_10_000 ... bench: 427,149,130 ns/iter (+/- 326,256,736)
test map::benches::bench_map_subpixels_sequential_30 ... bench: 3,537 ns/iter (+/- 572)
test map::benches::bench_map_subpixels_sequential_60 ... bench: 13,524 ns/iter (+/- 326)
the good value of MIN_SUBPIXEL_PER_CHUNK
completely depends on the input functions ; if we had a function that took 1ms to compute, then the parallel version would be much faster even for a 10 by 10 image.
therefore, imo, we should ship the parralel version without using with_min_len
, explaining that it is slower than sequential for small images and fast functions, but can be significantly faster for large images(1 million pixel+) and slow fuctions
i have also come to realize that chunking manually is actually pointless, because that's already what rayon does
T_T
i thought i had done clippy on my machine
I've de-duplicated the benchmark mapping function into the macro and switched map_subpixels_parallel()
to be more similar to the rest of the mapping functions using GenericImage
but with a Deref<[]>
bound. I tested benchmarks before and after the commits and it had no effects as expected.
The benchmarks take forever (200s) on my machine mainly due to the 10_000 functions (especially the sequential one), we'll probably have to remove those for the final PR.
makes sense
Thanks for the PR. I’ll wait until after the next release to merge this. Which will probably be next weekend as I don’t have much time on weekdays.
pub fn map_subpixels_parallel<P, Q, F>(image: &Image<P>, f: F) -> Image<Q>
where
P: Pixel + Sync,
P::Subpixel: Sync + Send,
Q: Pixel,
Q::Subpixel: Send,
F: Fn(P::Subpixel) -> Q::Subpixel + Sync,
{
use rayon::prelude::*;
let buf = image
.par_pixels()
.flat_map(|p| p.channels())
.copied()
.map(|x| f(x))
.collect();
let (width, height) = image.dimensions();
Image::from_vec(width, height, buf).unwrap()
}
I think all such functions should be in the image
crate.
I think all such functions should be in the
image
crate.
Interesting point. My opinion would be that anything that mutates or transforms images counts as image processing and so belongs in the imageproc
crate whereas the image
crate is for image representation, encoding and decoding. Under this rule these functions fall more under imageproc
than image
.
That said, the line here is quite blurry. I recently learned that image
has a imageops
module with several image processing operations such as contrast()
and brighten()
which is quite unsettling as I would think they belong in the imageproc
crate.
The image crate existed before imageproc. Sorting out what goes where was one of the first issues on this repo… and we’ve pretty much just ignored it for the last nine years! https://github.com/image-rs/imageproc/issues/7
@ripytide this looks reasonable to me - let me know when it's in a reviewable state.
I think we should talk through #622 first since the outcome of that will decide how many of the function variations to implement here.
Okay now that all the blockers have been resolved I've fleshed out the rest of this PR. Some other changes I made:
red_channel()
to into_red_channel()
since that is more clear on the operation being done.as_red_channel()
to from_red_channel()
for the same reason as above, also as_
functions are supposed to be cheap operations according to the rust API guidelines.map_pixels<I, P>(image: &I) where I: GenericImage<P>
to map_pixels<P>(image: &Image<P>)
since it provides substantial performance improvements as it enforces contiguous storage. This is more consistent with other imageproc
functions which use Image<P>
's and is more inline with the projects principle of performance over maximum genericness.@theotherphil this is now ready for review.
The new code looks good and the signature change makes sense. I’m not convinced by the name changes though. You’re right that as_
should be cheap, but there’s also a standard convention that into_
and from_
consume an owned value which isn’t the case here.
That's true, but it is the lesser evil in my opinion as users can see their signatures fairly easily from the docs. Unless you have an alternative suggestion?
One alternative:
grayscale_from_red_channel()
and rgb_from_red_grayscale()
?
I’ll merge as is and think about what best to name these before the next release.
Thanks for the new functions, and it’s good to see your new doc macros already paying off.
This is just a draft so I can get some feedback on an initial implementation of
map_subpixels_parallel()
before I add the remaining variation of functionsmap_subpixels_parallel_mut()
map_pixels_parallel()
map_pixels_parallel_mut()
.This would address #429