ripytide commented 5 months ago

Achieved via always computing with 4 channels using an array of length 4 since const generic expressions are unstable.

cospectrum commented 5 months ago

Still 4 times slower for me

ripytide commented 5 months ago

Can you push a branch with the code you are running to get that result so I can investigate your use-case?

ripytide commented 5 months ago

Okay now that #654 was merged this comparison becomes simpler. In the generic case (not exactly 3x3) this version reduces the code duplication and is simply faster.

cospectrum commented 5 months ago

Can you push a branch with the code you are running to get that result so I can investigate your use-case?

The same bench from https://github.com/image-rs/imageproc/pull/654#issuecomment-2133320142 using your branch

ripytide commented 5 months ago

@cospectrum can you share your implementation of filter3x3() as well so I can try to reproduce your results or are you using the filter3x3_reference()?

cospectrum commented 5 months ago

@cospectrum can you share your implementation of filter3x3() as well so I can try to reproduce your results or are you using the filter3x3_reference()?

It's not "my" implementation. It's from 0.25

ripytide commented 5 months ago

Okay I think I've tracked down the discrepancy between the imageproc benchmarks and your external benchmarks via a separate crate.

I get the same results as you by default. But if you add the same build config as imageproc to your external crate I get the same results as imageproc, I think that is what is causing your 4x regression.

Compile Settings

[profile.release]
opt-level = 3
debug = true

[profile.bench]
opt-level = 3
debug = true
rpath = false
lto = false
debug-assertions = false
codegen-units = 1

Benchmarked Code

#![feature(test)]

extern crate test;

use std::hint::black_box;

use test::Bencher;

const SIZE: u32 = 300;

fn main() {}

#[bench]
fn bench_025(b: &mut Bencher) {
    let image = black_box(imageproc_025::image::RgbImage::new(SIZE, SIZE));

    let kernel: Vec<i32> = (0..3 * 3).collect();

    b.iter(|| {
        let filtered = imageproc_025::filter::filter3x3::<_, _, i16>(&image, &kernel);
        black_box(filtered);
    });
}
#[bench]
fn bench_master(b: &mut Bencher) {
    let image = black_box(imageproc_master::image::RgbImage::new(SIZE, SIZE));

    let kernel: Vec<i32> = (0..3 * 3).collect();
    let kernel = imageproc_master::kernel::Kernel::new(&kernel, 3, 3);

    b.iter(|| {
        let filtered = imageproc_master::filter::filter_clamped::<_, _, i16>(&image, kernel);
        black_box(filtered);
    });
}

Results

Master No Settings

test bench_025    ... bench:   3,831,394 ns/iter (+/- 10,242)
test bench_master ... bench:   3,902,486 ns/iter (+/- 13,441)

PR No Settings

test bench_025    ... bench:   2,179,298 ns/iter (+/- 16,869)
test bench_master ... bench:   7,558,955 ns/iter (+/- 30,916)

Master With Settings

test bench_025    ... bench:   2,003,191 ns/iter (+/- 19,839)
test bench_master ... bench:   3,912,089 ns/iter (+/- 8,554)

PR With Settings

test bench_025    ... bench:   1,990,703 ns/iter (+/- 10,605)
test bench_master ... bench:   3,253,459 ns/iter (+/- 11,167)

ripytide commented 5 months ago

Now I suppose I just need to track down which specific build setting is making such a big difference, and why in the master no settings test is there a difference between bench_025 and bench_master.

ripytide commented 5 months ago

When using debug=false now the two implementations are both 3,300,000 but master_025 is still 2,000,000 which doesn't make any sense at all?

cospectrum commented 5 months ago

If I cargo run --release my bench https://github.com/image-rs/imageproc/pull/654#issuecomment-2133320142 with 0.25 against master, master will have the same settings as your branch, and master will have the same speed as 0.25, while yours will be slower (by 4 times)

theotherphil commented 5 months ago

I'll try this PR vs the current master on my MacBook when I have time within the next few days.

theotherphil commented 5 months ago

@cospectrum what machine are you benchmarking on? I appear to be seeing much larger regressions on an Apple ARM chip than @ripytide is on their Ryzen x86.

cospectrum commented 5 months ago

@cospectrum what machine are you benchmarking on? I appear to be seeing much larger regressions on an Apple ARM chip than @ripytide is on their Ryzen x86.

M1

ripytide commented 5 months ago

I think the setting which changes which implementation is faster (on my machine) is codegen-units:

Here I've benchmarked the default compile setting:

[profile.release]
codegen-units = 16

vs codegen-units=1 which is used by the benchmarks:

[profile.release]
codegen-units = 1

Results


test master_codegen=1 ... bench:   2,375,138 ns/iter (+/- 3,180)
test pr_codegen=1 ... bench:   1,974,065 ns/iter (+/- 6,905)
test master_codegen=16 ... bench:   2,489,427 ns/iter (+/- 10,247)
test pr_codegen=16 ... bench:   4,585,778 ns/iter (+/- 27,787)
``

theotherphil commented 5 months ago

I’ve not had much time in the last couple of weeks but I still intend to benchmark this branch against master on my MacBook.

theotherphil commented 5 months ago

git checkout master
git pull
cargo +nightly bench filter_clamped_rgb >> bench_master.txt

git fetch origin pull/656/head:pr656
git checkout pr656
cargo +nightly bench filter_clamped_rgb >> bench_pr656.txt

Results on master:

test filter::benches::bench_filter_clamped_rgb_3x3                                    ... bench:   1,190,189.60 ns/iter (+/- 33,991.58)
test filter::benches::bench_filter_clamped_rgb_5x5                                    ... bench:   2,903,974.95 ns/iter (+/- 10,175.80)
test filter::benches::bench_filter_clamped_rgb_7x7                                    ... bench:   5,348,075.00 ns/iter (+/- 38,132.12)

Results for this PR:

test filter::benches::bench_filter_clamped_rgb_3x3                                    ... bench:   1,762,945.85 ns/iter (+/- 57,415.80)
test filter::benches::bench_filter_clamped_rgb_5x5                                    ... bench:   3,940,433.40 ns/iter (+/- 10,009.45)
test filter::benches::bench_filter_clamped_rgb_7x7                                    ... bench:   7,041,208.30 ns/iter (+/- 22,404.16)

This still has large regressions, at least on an Apple ARM chip, so I'll close this PR.

I'll need to look at filter performance a bit anyway before the next release of this crate, to resolve https://github.com/image-rs/imageproc/issues/664.

image-rs / imageproc

Removed allocation from `filter_pixel()` #656

Compile Settings

Benchmarked Code

Results

Master No Settings

PR No Settings

Master With Settings

PR With Settings

Results