image-rs / imageproc

Image processing operations
MIT License
736 stars 144 forks source link

Add Rayon implementation `filter_clamped_parallel` for Kernel filter function (formerly `filter3x3`, ...) #608

Closed sunsided closed 3 months ago

sunsided commented 3 months ago

This adds a parallelized implementation for the filter3x3 filter_clamped function implementation.

ripytide commented 3 months ago

Normally, we add parallel functions with _parallel on the end not via switching the same function via a feature. Otherwise a user would never be able to use the single and multi threaded version in the same application. Also cargo features are supposed to be additive which this breaks.

sunsided commented 3 months ago

That sounds like a solution to my problem. Will refactor it.

sunsided commented 3 months ago

@ripytide Rewrote it. There's quite some code duplication now with the bigger driver functions filter and gradients, but I figured it's easier to judge this way, and there might be more optimization potential there as well.

I also added the _parallel functions to the test and benchmarks now. Some output from my machine to give an idea:

test filter::benches::bench_filter3x3_i32_filter                                      ... bench:   1,685,176.00 ns/iter (+/- 8,628.49)
test filter::benches::bench_filter3x3_parallel_i32_filter                             ... bench:     429,208.09 ns/iter (+/- 44,841.39)
test seam_carving::benches::bench_shrink_width_s100_r1                                ... bench:     339,531.20 ns/iter (+/- 3,855.30)
test seam_carving::benches::bench_shrink_width_s100_r4                                ... bench:   1,332,217.10 ns/iter (+/- 3,669.44)
test seam_carving::benches::bench_shrink_width_s100_r8                                ... bench:   2,611,614.40 ns/iter (+/- 14,979.44)
test seam_carving::benches::bench_shrink_width_parallel_s100_r1                       ... bench:     123,885.59 ns/iter (+/- 16,124.79)
test seam_carving::benches::bench_shrink_width_parallel_s100_r4                       ... bench:     502,462.13 ns/iter (+/- 54,026.64)
test seam_carving::benches::bench_shrink_width_parallel_s100_r8                       ... bench:     983,152.75 ns/iter (+/- 141,549.54)
ripytide commented 3 months ago

At current this parallelizes each row of the image. In our discussions and experimentation in #602 we found that it may be worth simply parallelizing per pixel rather than per row as rayon does its own chunking of work to prevent excess dispatching and so there was little performance difference. Also this way it scales more flexibly with images with different row sizes.

Also regarding the code duplication, would it be possible to extract the kernel code from both functions and simply call it in normal for loops for the single-threaded versions, and call it in the par_pixels() from the multi-threaded version?

sunsided commented 3 months ago

Good point! I didn't even consider pixel-wise operations due to a somewhat baseless concern about hypothetical cache coherency issues, i.e. I was essentially on the same page as this thought, with the known gotcha that it degrades with special image dimensions.

Would you want me to try anyways or leave it like it is for the time being in order to improve on it later on? My thought here is that it is already much faster than the non-parallelized version, and with there (now) being the sequential and parallel versions available at the same time, caller code can make informed decisions about which one to prefer when. I'm fine with trying more (and will), but I'm not exactly a Rayon expert. 😅

Regardless of the parallelization, I'll definitely have a look at refactoring the code duplication away. ✌🏼

ripytide commented 3 months ago

As you say having a parallel implementation is much more important than making the implementation as fast as possible as that can always be improved in the future. So I'd say just make the code as readable and de-duplicated as you can and don't worry too much about micro-benchmarking.

sunsided commented 3 months ago

Circling back to the earlier comments, I did try with a pixel-wise parallelization via a naive out_data.par_chunks_mut(num_channels) instead of out_data.par_chunks_mut(width as usize * num_channels), but it degraded performance compared to the rowwise version. I'll give it another thought, but it's probably for another time.

ripytide commented 3 months ago

Okay I've finished the re-write from scratch commit here using an inner per-pixel filter function and according to my testing the new version is [1.02x faster on sequential, 1.2x faster on parallel] which is within margin of error. It's also de-duplicated as discussed.

Do you want to pull that commit onto this PR or me to do a separate PR after this one is merged?

sunsided commented 3 months ago

@ripytide Cool! I merged it in here.

theotherphil commented 3 months ago

Thanks for this. The new functionality looks good. However, I'm not keen on reducing the flexibility of the existing filter function for the sake of saving a bit of code duplication with the new function. And I'm not sure what impact this refactoring has on the runtime performance of the non-parallel filter function. Please could you check the benchmarks before and after this change.

ripytide commented 3 months ago

However, I'm not keen on reducing the flexibility of the existing filter

What do you mean by reducing the flexibility of filter(), it has an identical function signature?

Well except for f changing from FnMut(&mut Q::Subpixel, K) to Fn(K) -> Q::Subpixel which I would argue is a clearer function signature for what is essentially a mapping operation. And technically the bounds also changed from Q::Subpixel: Into<K> to K: From<Q::Subpixel> but I'm not sure if there's a difference between them.

ripytide commented 3 months ago

According to my benchmarking the non-parallel version is within noise-margin of the speed of the previous implementation. Here is the raw benchmark comparison output:

 name                                                                          before  ns/iter  after  ns/iter  diff ns/iter   diff %  speedup 
 filter::benches::bench_box_filter                                             2,068,955        2,068,628               -327   -0.02%   x 1.00 
 filter::benches::bench_filter_clamped_i32_filter                              3,304,223        3,224,476            -79,747   -2.41%   x 1.02 
 filter::benches::bench_gaussian_f32_stdev_1                                   286,579          286,494                  -85   -0.03%   x 1.00 
 filter::benches::bench_gaussian_f32_stdev_10                                  1,384,830        1,386,037              1,207    0.09%   x 1.00 
 filter::benches::bench_gaussian_f32_stdev_3                                   527,508          526,958                 -550   -0.10%   x 1.00 
 filter::benches::bench_horizontal_filter                                      906,407          904,490               -1,917   -0.21%   x 1.00 
 filter::benches::bench_separable_filter                                       667,202          667,295                   93    0.01%   x 1.00 
 filter::benches::bench_vertical_filter                                        940,168          940,354                  186    0.02%   x 1.00 
theotherphil commented 3 months ago

However, I'm not keen on reducing the flexibility of the existing filter

What do you mean by reducing the flexibility of filter(), it has an identical function signature?

Well except for f changing from FnMut(&mut Q::Subpixel, K) to Fn(K) -> Q::Subpixel which I would argue is a clearer function signature for what is essentially a mapping operation. And technically the bounds also changed from Q::Subpixel: Into<K> to K: From<Q::Subpixel> but I'm not sure if there's a difference between them.

Yes, I meant this change to the function signature. I don’t feel too strongly - it's probably a pretty niche use case and if we do break anyone’s code I guess they’ll let us know. I’ll merge this after the merge conflict is resolved.

ripytide commented 3 months ago

I've rebased this in #642

sunsided commented 3 months ago

I've rebased this in #642

Thank you! Should we close this PR here then?

ripytide commented 3 months ago

Yeah can do

sunsided commented 3 months ago

Closing in favor of #642