Closed ripytide closed 5 months ago
Still 4 times slower for me
Can you push a branch with the code you are running to get that result so I can investigate your use-case?
Okay now that #654 was merged this comparison becomes simpler. In the generic case (not exactly 3x3) this version reduces the code duplication and is simply faster.
Can you push a branch with the code you are running to get that result so I can investigate your use-case?
The same bench from https://github.com/image-rs/imageproc/pull/654#issuecomment-2133320142 using your branch
@cospectrum can you share your implementation of filter3x3()
as well so I can try to reproduce your results or are you using the filter3x3_reference()
?
@cospectrum can you share your implementation of
filter3x3()
as well so I can try to reproduce your results or are you using thefilter3x3_reference()
?
It's not "my" implementation. It's from 0.25
Okay I think I've tracked down the discrepancy between the imageproc
benchmarks and your external benchmarks via a separate crate.
I get the same results as you by default. But if you add the same build config as imageproc
to your external crate I get the same results as imageproc
, I think that is what is causing your 4x
regression.
[profile.release]
opt-level = 3
debug = true
[profile.bench]
opt-level = 3
debug = true
rpath = false
lto = false
debug-assertions = false
codegen-units = 1
#![feature(test)]
extern crate test;
use std::hint::black_box;
use test::Bencher;
const SIZE: u32 = 300;
fn main() {}
#[bench]
fn bench_025(b: &mut Bencher) {
let image = black_box(imageproc_025::image::RgbImage::new(SIZE, SIZE));
let kernel: Vec<i32> = (0..3 * 3).collect();
b.iter(|| {
let filtered = imageproc_025::filter::filter3x3::<_, _, i16>(&image, &kernel);
black_box(filtered);
});
}
#[bench]
fn bench_master(b: &mut Bencher) {
let image = black_box(imageproc_master::image::RgbImage::new(SIZE, SIZE));
let kernel: Vec<i32> = (0..3 * 3).collect();
let kernel = imageproc_master::kernel::Kernel::new(&kernel, 3, 3);
b.iter(|| {
let filtered = imageproc_master::filter::filter_clamped::<_, _, i16>(&image, kernel);
black_box(filtered);
});
}
test bench_025 ... bench: 3,831,394 ns/iter (+/- 10,242)
test bench_master ... bench: 3,902,486 ns/iter (+/- 13,441)
test bench_025 ... bench: 2,179,298 ns/iter (+/- 16,869)
test bench_master ... bench: 7,558,955 ns/iter (+/- 30,916)
test bench_025 ... bench: 2,003,191 ns/iter (+/- 19,839)
test bench_master ... bench: 3,912,089 ns/iter (+/- 8,554)
test bench_025 ... bench: 1,990,703 ns/iter (+/- 10,605)
test bench_master ... bench: 3,253,459 ns/iter (+/- 11,167)
Now I suppose I just need to track down which specific build setting is making such a big difference, and why in the master no settings
test is there a difference between bench_025
and bench_master
.
When using debug=false
now the two implementations are both 3,300,000
but master_025
is still 2,000,000
which doesn't make any sense at all?
If I cargo run --release
my bench https://github.com/image-rs/imageproc/pull/654#issuecomment-2133320142 with 0.25
against master
, master
will have the same settings as your branch, and master
will have the same speed as 0.25
, while yours will be slower (by 4 times)
I'll try this PR vs the current master on my MacBook when I have time within the next few days.
@cospectrum what machine are you benchmarking on? I appear to be seeing much larger regressions on an Apple ARM chip than @ripytide is on their Ryzen x86.
@cospectrum what machine are you benchmarking on? I appear to be seeing much larger regressions on an Apple ARM chip than @ripytide is on their Ryzen x86.
M1
I think the setting which changes which implementation is faster (on my machine) is codegen-units
:
Here I've benchmarked the default compile setting:
[profile.release]
codegen-units = 16
vs codegen-units=1
which is used by the benchmarks:
[profile.release]
codegen-units = 1
test master_codegen=1 ... bench: 2,375,138 ns/iter (+/- 3,180)
test pr_codegen=1 ... bench: 1,974,065 ns/iter (+/- 6,905)
test master_codegen=16 ... bench: 2,489,427 ns/iter (+/- 10,247)
test pr_codegen=16 ... bench: 4,585,778 ns/iter (+/- 27,787)
``
I’ve not had much time in the last couple of weeks but I still intend to benchmark this branch against master on my MacBook.
git checkout master
git pull
cargo +nightly bench filter_clamped_rgb >> bench_master.txt
git fetch origin pull/656/head:pr656
git checkout pr656
cargo +nightly bench filter_clamped_rgb >> bench_pr656.txt
Results on master:
test filter::benches::bench_filter_clamped_rgb_3x3 ... bench: 1,190,189.60 ns/iter (+/- 33,991.58)
test filter::benches::bench_filter_clamped_rgb_5x5 ... bench: 2,903,974.95 ns/iter (+/- 10,175.80)
test filter::benches::bench_filter_clamped_rgb_7x7 ... bench: 5,348,075.00 ns/iter (+/- 38,132.12)
Results for this PR:
test filter::benches::bench_filter_clamped_rgb_3x3 ... bench: 1,762,945.85 ns/iter (+/- 57,415.80)
test filter::benches::bench_filter_clamped_rgb_5x5 ... bench: 3,940,433.40 ns/iter (+/- 10,009.45)
test filter::benches::bench_filter_clamped_rgb_7x7 ... bench: 7,041,208.30 ns/iter (+/- 22,404.16)
This still has large regressions, at least on an Apple ARM chip, so I'll close this PR.
I'll need to look at filter performance a bit anyway before the next release of this crate, to resolve https://github.com/image-rs/imageproc/issues/664.
Achieved via always computing with 4 channels using an array of length 4 since const generic expressions are unstable.