Closed theotherphil closed 5 months ago
This issue is to add an implementation of filter3x3 along the lines of what we had when this library was first written - a raw loop that averages the pixel values at 9 fixed offsets from the current coordinate - to check that we don’t accidentally end up with an implementation that’s slower than that for this use case.
The filter has become 4 times slower
malloc
for each pixel 😮💨
Not if the compiler optimizes it out, which I'm curious about seeing as it was the same speed as the previous implementation according my benchmarking. I'm hoping the Pixel
trait will get compile-time length info at some point which would allow us to remove the heap allocations.
One method would be if image
switches to rgb
's 0.9
release which I'm working on atm.
@cospectrum which versions are you comparing?
@cospectrum which versions are you comparing?
master
and 0.25
Ouch. I’ll take a look today. @ripytide has benchmarks showing no impact at https://github.com/image-rs/imageproc/pull/608#issuecomment-2122314598 so something is odd here.
Comparing v0.25.0 and current master (5e00bd1a6df330f7be76e556bce89c1c1080a40a) on a 2021 MacBook Pro with an M1 Max chip.
The function signatures and set of benchmarks have both changed across this range so I've
// For v0.25
#[bench]
fn bench_filter3x3_i32_filter(b: &mut Bencher) {
let image = gray_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1,
-2, 0, 2,
-1, 0, 1
];
b.iter(|| {
let filtered: ImageBuffer<Luma<i16>, Vec<i16>> =
filter3x3::<_, _, i16>(&image, &kernel);
black_box(filtered);
});
}
#[bench]
fn bench_filter5x5_i32_filter(b: &mut Bencher) {
let image = gray_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1, 2, 3,
-2, 0, 2, 3, 5,
-1, 0, 1, 6, 9,
-1, 0, 1, 6, 9,
-1, 0, 1, 6, 9
];
b.iter(|| {
let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
filter5x5::<_, _, i16>(&image, &kernel);
black_box(filtered);
});
}
#[bench]
fn bench_filter7x7_i32_filter(b: &mut Bencher) {
let image = gray_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1, 2, 3, -2, 1,
-2, 0, 2, 3, 5, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1
];
b.iter(|| {
let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
filter7x7::<_, _, i16>(&image, &kernel);
black_box(filtered);
});
}
// For master
#[bench]
fn bench_filter3x3_i32_filter(b: &mut Bencher) {
let image = gray_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1,
-2, 0, 2,
-1, 0, 1
];
let kernel = Kernel::new(&kernel, 3, 3);
b.iter(|| {
let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
filter_clamped::<_, _, i16>(&image, kernel);
black_box(filtered);
});
}
#[bench]
fn bench_filter5x5_i32_filter(b: &mut Bencher) {
let image = gray_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1, 2, 3,
-2, 0, 2, 3, 5,
-1, 0, 1, 6, 9,
-1, 0, 1, 6, 9,
-1, 0, 1, 6, 9
];
let kernel = Kernel::new(&kernel, 5, 5);
b.iter(|| {
let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
filter_clamped::<_, _, i16>(&image, kernel);
black_box(filtered);
});
}
#[bench]
fn bench_filter7x7_i32_filter(b: &mut Bencher) {
let image = gray_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1, 2, 3, -2, 1,
-2, 0, 2, 3, 5, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1
];
let kernel = Kernel::new(&kernel, 7, 7);
b.iter(|| {
let filtered: image::ImageBuffer<Luma<i16>, Vec<i16>> =
filter_clamped::<_, _, i16>(&image, kernel);
black_box(filtered);
});
}
Results:
// Master
filter::benches::bench_filter3x3_i32_filter ... bench: 2,113,838.52 ns/iter (+/- 146,344.38)
filter::benches::bench_filter5x5_i32_filter ... bench: 5,102,450.00 ns/iter (+/- 1,087,624.48)
filter::benches::bench_filter7x7_i32_filter ... bench: 9,306,050.00 ns/iter (+/- 1,542,055.76)
// v0.25
filter::benches::bench_filter3x3_i32_filter ... bench: 1,800,742.73 ns/iter (+/- 97,078.87)
filter::benches::bench_filter5x5_i32_filter ... bench: 2,032,997.93 ns/iter (+/- 444,828.73)
filter::benches::bench_filter7x7_i32_filter ... bench: 4,564,652.15 ns/iter (+/- 1,259,277.64)
So that's 1.17x, 2.55x and 2.04x slowdowns for 3x3, 5x5 and 7x7 kernels on 500x500 grayscale images.
Master now has benchmarks for 3x3, 5x5 and 7x7 images as standard, although with the image size reduced to 300x300, so these comparisons should be easier in future.
@theotherphil could you push your modified v0.25
branch so I can use that for my benchmark?
This branch has the new benchmarks plus filter5x5 and filter7x7: https://github.com/image-rs/imageproc/tree/v0.25.0-branch
I'll add some for RGB images too, in case the regressions are larger than than for grayscale.
RGB:
// v0.25
#[bench]
fn bench_filter3x3_i32_filter_rgb(b: &mut Bencher) {
let image = rgb_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1,
-2, 0, 2,
-1, 0, 1
];
b.iter(|| {
let filtered: ImageBuffer<Rgb<i16>, Vec<i16>> =
filter3x3::<_, _, i16>(&image, &kernel);
black_box(filtered);
});
}
#[bench]
fn bench_filter5x5_i32_filter_rgb(b: &mut Bencher) {
let image = rgb_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1, 2, 3,
-2, 0, 2, 3, 5,
-1, 0, 1, 6, 9,
-1, 0, 1, 6, 9,
-1, 0, 1, 6, 9
];
b.iter(|| {
let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
filter5x5::<_, _, i16>(&image, &kernel);
black_box(filtered);
});
}
#[bench]
fn bench_filter7x7_i32_filter_rgb(b: &mut Bencher) {
let image = rgb_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1, 2, 3, -2, 1,
-2, 0, 2, 3, 5, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1
];
b.iter(|| {
let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
filter7x7::<_, _, i16>(&image, &kernel);
black_box(filtered);
});
}
// Master
#[bench]
fn bench_filter3x3_i32_filter_rgb(b: &mut Bencher) {
let image = rgb_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1,
-2, 0, 2,
-1, 0, 1
];
let kernel = Kernel::new(&kernel, 3, 3);
b.iter(|| {
let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
filter_clamped::<_, _, i16>(&image, kernel);
black_box(filtered);
});
}
#[bench]
fn bench_filter5x5_i32_filter_rgb(b: &mut Bencher) {
let image = rgb_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1, 2, 3,
-2, 0, 2, 3, 5,
-1, 0, 1, 6, 9,
-1, 0, 1, 6, 9,
-1, 0, 1, 6, 9
];
let kernel = Kernel::new(&kernel, 5, 5);
b.iter(|| {
let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
filter_clamped::<_, _, i16>(&image, kernel);
black_box(filtered);
});
}
#[bench]
fn bench_filter7x7_i32_filter_rgb(b: &mut Bencher) {
let image = rgb_bench_image(500, 500);
#[rustfmt::skip]
let kernel: Vec<i32> = vec![
-1, 0, 1, 2, 3, -2, 1,
-2, 0, 2, 3, 5, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1,
-1, 0, 1, 6, 9, -2, 1
];
let kernel = Kernel::new(&kernel, 7, 7);
b.iter(|| {
let filtered: image::ImageBuffer<Rgb<i16>, Vec<i16>> =
filter_clamped::<_, _, i16>(&image, kernel);
black_box(filtered);
});
}
Results:
// Master
filter::benches::bench_filter3x3_i32_filter_rgb ... bench: 19,248,650.10 ns/iter (+/- 2,161,878.78)
filter::benches::bench_filter5x5_i32_filter_rgb ... bench: 26,113,808.30 ns/iter (+/- 1,566,443.61)
filter::benches::bench_filter7x7_i32_filter_rgb ... bench: 33,759,604.10 ns/iter (+/- 819,419.97)
// v0.25
filter::benches::bench_filter3x3_i32_filter_rgb ... bench: 2,414,054.20 ns/iter (+/- 51,805.40)
filter::benches::bench_filter5x5_i32_filter_rgb ... bench: 5,215,018.75 ns/iter (+/- 1,126,537.41)
filter::benches::bench_filter7x7_i32_filter_rgb ... bench: 10,263,666.60 ns/iter (+/- 364,188.05)
The regressions are much larger for RGB images.
test filter::benches::bench_bilateral_filter ... bench: 75,295,576 ns/iter (+/- 211,489)
test filter::benches::bench_box_filter ... bench: 2,070,778 ns/iter (+/- 12,442)
test filter::benches::bench_filter3x3_i32_filter ... bench: 3,099,127 ns/iter (+/- 24,920)
test filter::benches::bench_filter3x3_i32_filter_rgb ... bench: 3,375,630 ns/iter (+/- 131,017)
test filter::benches::bench_filter5x5_i32_filter ... bench: 2,390,178 ns/iter (+/- 92,977)
test filter::benches::bench_filter5x5_i32_filter_rgb ... bench: 9,743,942 ns/iter (+/- 164,547)
test filter::benches::bench_filter7x7_i32_filter ... bench: 4,318,360 ns/iter (+/- 12,310)
test filter::benches::bench_filter7x7_i32_filter_rgb ... bench: 16,803,183 ns/iter (+/- 749,961)
test filter::benches::bench_gaussian_f32_stdev_1 ... bench: 287,123 ns/iter (+/- 3,185)
test filter::benches::bench_gaussian_f32_stdev_10 ... bench: 1,384,883 ns/iter (+/- 10,495)
test filter::benches::bench_gaussian_f32_stdev_3 ... bench: 527,646 ns/iter (+/- 3,742)
test filter::benches::bench_horizontal_filter ... bench: 904,917 ns/iter (+/- 8,832)
test filter::benches::bench_separable_filter ... bench: 666,254 ns/iter (+/- 2,835)
test filter::benches::bench_vertical_filter ... bench: 940,404 ns/iter (+/- 4,227)
test filter::benches::bench_box_filter ... bench: 2,064,973 ns/iter (+/- 3,673)
test filter::benches::bench_filter_clamped_gray_3x3 ... bench: 1,160,258 ns/iter (+/- 42,470)
test filter::benches::bench_filter_clamped_gray_5x5 ... bench: 2,610,301 ns/iter (+/- 5,058)
test filter::benches::bench_filter_clamped_gray_7x7 ... bench: 4,679,867 ns/iter (+/- 16,102)
test filter::benches::bench_filter_clamped_parallel_gray_3x3 ... bench: 295,250 ns/iter (+/- 138,320)
test filter::benches::bench_filter_clamped_parallel_gray_5x5 ... bench: 601,325 ns/iter (+/- 53,330)
test filter::benches::bench_filter_clamped_parallel_gray_7x7 ... bench: 1,075,725 ns/iter (+/- 44,041)
test filter::benches::bench_gaussian_f32_stdev_1 ... bench: 295,152 ns/iter (+/- 1,174)
test filter::benches::bench_gaussian_f32_stdev_10 ... bench: 1,383,488 ns/iter (+/- 5,924)
test filter::benches::bench_gaussian_f32_stdev_3 ... bench: 527,800 ns/iter (+/- 2,275)
test filter::benches::bench_horizontal_filter ... bench: 908,493 ns/iter (+/- 9,717)
test filter::benches::bench_separable_filter ... bench: 670,450 ns/iter (+/- 17,449)
test filter::benches::bench_vertical_filter ... bench: 942,569 ns/iter (+/- 6,466)
@ripytide are your Master benchmarks using a 300x300 image or 500x500?
Whatever the master
branch is currently using, which looks like 300x300
Ah, so your master and 0.25 benchmarks aren’t comparing like for like as I confusingly switched from 500x500 to 300x300 images when I added the 5x5 and 7x7 filter benchmarks.
After reverting the use of filter_pixel
, the current implementation of filter_clamped
is still 3x slower than a simple hardcoded implementation for grayscale images and 3x3 kernels: https://github.com/image-rs/imageproc/pull/658
Benchmarks added. Created https://github.com/image-rs/imageproc/issues/664 to fix the performance issues with the current filter implementation.
It seems #646 adds an implementation for 3x3, 5x5, 7x7, this might be closable then. Or is there something I'm missing with this issue as compared to #643?