image-rs / imageproc

Image processing operations
MIT License
758 stars 149 forks source link

optimisation of grayscale morphology operators #597

Closed Morgane55440 closed 6 months ago

Morgane55440 commented 6 months ago

i'd like to make a new PR focused on optimisation for the grayscale morphology operators now that #581 has been merged if that's ok.

theotherphil commented 6 months ago

Sounds good!

cospectrum commented 6 months ago

What kind of optimizations?

cospectrum commented 6 months ago

i'd like to make a new PR focused on optimisation for the grayscale morphology operators now that #581 has been merged if that's ok.

cache_miss

I think you have a lot of cache misses here because you are accessing image pixels in col-major order.

cospectrum commented 6 months ago

So you can safely change the order of elements: Vec<(i16, i16)> to row-major. And I would change (i16, i16) to Point<i16> for readability.

Morgane55440 commented 6 months ago

i'm mostly aiming to improve the dilation and erosion, edit but i'll look into the constructor too. my current plan is :

Morgane55440 commented 6 months ago

example of the potential use of SIMD instructions.(this might not work at all, but i thought it might be worth it for me to test) 00.00.00.00_00.00.00.00_11.11.11.11_11.11.11.11_11.11.11.11_11.11.11.11_00.00.00.00_00.00.00.00 -> mask

10.01.11.00_01.11.00.01_11.01.11.11_11.00.01.00_01.11.00.00_11.11.01.11_11.00.01.11_00.00.00.00 -> data

bitwise and -> 1 operation

00.00.00.00_00.00.00.00_11.01.11.11_11.00.01.00_01.11.00.00_11.11.01.11_00.00.00.00_00.00.00.00

reduce_max -> 1 operations ? the nightly rust SIMD api and other sources mention reduce_max for packed being a cpu operation operation.

11.11.01.11 <- the output of the dilation

of course this is idealized greatly, there are many issues i haven't mentioned, and as SIMD is still nightly i can only nudge the compiler toward using these tools, but i know in the right circumstances it is able to use SIMD instructions, so it might be worth looking into at least

theotherphil commented 6 months ago

@Morgane55440 is this issue complete now or are you still considering the other optimisation ideas you mentioned above?

Morgane55440 commented 6 months ago

i tink i still would like to try the bit_mask technique. honestly i don't expect it to actually be great because the SIMD operation that is likely being used atm(prob something like _mm512_min_epu8) which does byte-by-byte min, is from what i understand faster than computing the minimum byte from a group of bytes, which was my original idea, but there still is a good chance it might actually be faster. i'll try to do some tests on my machine, if they're inconclusive i'll close

cospectrum commented 6 months ago

Simd is architecture dependent, currently imageproc is targeting cross platform CPU I think

theotherphil commented 6 months ago

We can remain cross-platform using runtime detection of SIMD support. I've no idea how practical this will end up being, but there's potentially a lot of speedup available for some functions.

Morgane55440 commented 6 months ago

Simd is architecture dependent, currently imageproc is targeting cross platform CPU I think

i think so too, and i'm not actually telling the compiler to use any SIMD at the moment (the standard SIMD module is still unstable), but aside from completely platform independent things like removing unnecessary instructions, i have to make assumptions to guide optimizations, like how the cache works, and what SIMD instructions are available. my understanding is things packed_min, packed_add, packed_xor are very common across architectures, butreduce_min less so. But of course these are complete assumptions, and i don't consider them as proof that one method is faster than another, but as potential guides.

ultimately, i can only bench things on my machine, i'd love more targets but i'm not sure how to do that easily.

btw i just checked (that was completely unnecessary), and all modules wich mention SIMD in core::arch mention a version of packed_min for u8 for their architecture(umin8 on risc, vec_min on powerpc, u8x16_min on wasm, vmin_u8 on arm, _mm256_min_epu8 on x86, vmin_u8 on aarch64), but reduce_min is basically unfindable, even though i heard some mention of it being an instruction i can't find it

Morgane55440 commented 6 months ago

the simd module aims to be

a portable abstraction for SIMD operations that is not bound to any particular hardware architecture.

this would be great in the future when it is stable, but for now i'm relying on the compiler. the nightly API provides both versions of packed_min(simd_min) and reduce_min(reduce_min), but my guess is it would do non-SIMD operation on architectures that don't support it, which i would expect to be more common for reduce_min than simd_min.

of course my knowledge of hardware instructions is very limited, so i could be completely wrong

cospectrum commented 6 months ago

There is no portable simd, it's not ready.

Morgane55440 commented 6 months ago

if you reread my comments, you might notice that is what i've spent the last 2 comments saying

i'm really sorry if i appear aggressive, that is not my intention, but i spent a long time to be as clear as possible in my messages and it feels like you just didn't read them

cospectrum commented 6 months ago

We can remain cross-platform using runtime detection of SIMD support. I've no idea how practical this will end up being, but there's potentially a lot of speedup available for some functions.

I don't believe there is a potential benefit. Also, runtime architecture detection is not cross-platform. In addition, difficulties with tests will begin, ci will no longer cover the entire code. It will also affect readability and security. And just for example, there are processors with sse3, but without avx. There are processors with mmx, but without sse. There are processors with sse2, but without sse3. There is avx without avx512. That is, the code should be about 10 times larger than it is now, to be fair to everyone.

Morgane55440 commented 6 months ago

Firstly : sorry for the delay, swamped with deadlines atm, will continue to be for the close future

secondly : having taken a step bak, i have been some making some passive-agressive/regular agressive comments. i'm really sorry for that, i'm going to work on my communication more. even the last one was pretty bad despite my intentions. communication is really harder than one would think

Morgane55440 commented 6 months ago

We can remain cross-platform using runtime detection of SIMD support. I've no idea how practical this will end up being, but there's potentially a lot of speedup available for some functions.

I don't believe there is a potential benefit. Also, runtime architecture detection is not cross-platform. In addition, difficulties with tests will begin, ci will no longer cover the entire code. It will also affect readability and security. And just for example, there are processors with sse3, but without avx. There are processors with mmx, but without sse. There are processors with sse2, but without sse3. There is avx without avx512. That is, the code should be about 10 times larger than it is now, to be fair to everyone.

i agree with that, i don't think detecting specific hardware would be super useful for a hardware agnostic library.

once simd goes stable (whenever that happens, it took years for all i know), it might be worth looking into, but for now, simply write code that nudges the compiler into using SIMD. things like :

for (input, output) in (u8_slice_1).zip(u8_slice_2)
{
     *output = min(*input, *output);
}

are very likely to be compiled to SIMD whenever the compiler allows it on the given architecture and optimization flags are on.

for example, when i put the following code in a compiler explorer with "-O3" optimization (which as i understand it is our bench and relase optimization level) (i'm not 100% if it was targeting my architecture specifically, but it was some intel architecture), i found that it used the pminub(a packed min SIMD instruction) instruction on the 16 byte registers

pub fn main() {
    let mut n = 20usize;
    std::hint::black_box(&mut n);
    let mut v = vec![0u8;n];
    let mut v2 = vec![0u8;n];
    std::hint::black_box(&mut v);
    std::hint::black_box(&mut v2);
    for (input, output) in v.iter().zip(v2.iter_mut())
    {
        *output = (*output).min(*input);
    };
    let x = v.into_iter().min();
    std::hint::black_box(&x);
}

of course, to be absolutely clear, these things would impact different architectures differently, and it would be absolutely impossible to predict them all. but modifying code in ways that make it obvious to the compiler that SIMD could be used could potentially lead to significant speed increases. also given that i'm gonna look into it on my own, it doesn't matter much if i'm wrong

theotherphil commented 6 months ago

@cospectrum there surely are some performance benefits to be had from manual vectorisation, but I agree that handling multiple architectures and differing processor support would likely complicate the code significantly.

@Morgane55440 please don’t feel any pressure to respond quickly to discussions on an open source library :-) And for what it’s worth your comments all come across as perfectly polite and reasonable to me.

As @Morgane55440 has landed some significant performance improvements already and it looks like they have no specific future changes in flight I’ll close this issue as complete.