iseahound / ImagePut

A core library for images in AutoHotkey. Supports AutoHotkey v1 and v2.
https://www.autohotkey.com/boards/viewtopic.php?f=83&t=76633
MIT License
115 stars 24 forks source link

Improve ImageSearch 2 #35

Open iseahound opened 6 months ago

iseahound commented 6 months ago

This is the second attempt at improving ImageSearch.

The following fixes a bug where the last row of images is not searched.

https://godbolt.org/z/oEoMW7zEK

iseahound commented 6 months ago

Use -O3 for the above script. For some reason it wasn't allocating variables correctly into registers and instead using the stack which is slower.

iseahound commented 6 months ago

https://godbolt.org/z/WPWjf8c96

martona commented 6 months ago

I don't know if you'd be interested in an external contribution. I'm working on an imagesearch replacement with more features (tolerance and % of pixels matched). I'm getting close to finishing it but I'm not dead set on releasing it as standalone.

I have 512 and 128-bit SIMD working, and also have the pedestrian base code that doesn't use vectors, even though that's not a strict necessity because every 64-bit CPU should have at least SSE2. Pretty happy with what I've got so far, at least with the speed.

I still need to write the 256-bit part but that should be easy with how modular the thing ended up being. The unknown is the CPUID stuff and sorting the computers into v4/v3/v2/v1, which largely describes the vectorization tier but also the gcc -march option used.

Anyway, lmk if there's any interest.

iseahound commented 6 months ago

Sure thing! As many approaches as possible are the best possible solution. I'm still working on optimizing the basic ImageSearch functionality by avoiding the O(n²) subimage matching loop as much as possible.

iseahound commented 6 months ago

Saved work (notes to self):

iseahound commented 6 months ago

@martona There is a requirement for a "focused pixel" (an additional pixel in the center of the image or to be determined by some algorithm). See: https://godbolt.org/z/j6bvo1YPd

An explanation: Doing 2 pixel searches in parallel reduces the need to enter the subimage matching loop (for lack of a better term). Also, if the top-left pixel is transparent, it's necessary to have a visible pixel somewhere as a backup.
iseahound commented 6 months ago

Notes: https://github.com/iseahound/ImagePut/issues/33 - Completed optimizations using parallel PixelSearch before ImageSearch. https://github.com/iseahound/ImagePut/pull/36 - Adding variation to ImageSearch (wind0204)

martona commented 6 months ago

I'll tidy things up and give you access to a repo later today (i'm in the US, East), you can check it out for yourself.

My take on the C stuff stuff you're doing... don't sweat optimizing it. I mean you can, it's fun, but vectors are so much faster. I don't have recent SSE2 numbers at hand but AVX512 is an order of magnitude faster than what you can achieve with wrangling standard C code. SSE would be slower, probably half the speed of AVX (much of it because it's missing integer cmple/cmpge for some reason). 256-bit SIMD should be the sweet spot.

Just to give you a rough idea, my test haystack is 822x987, needle (75x51) is found at 680x430 or so, and avx512 can do it in 2ms with the benchmark written in AHK and using ImagePutBuffers for both haystack and needle, so there's some overhead. If i'm not forcing a topleft pixel match (which is an option), and it means entering the subimage search for every haystack pixel, it's still just 10ms.

But, to be fair, I ignore he A channel completely. I suppose handling it in a non-fancy way wouldn't slow things down much. I just never had the need for it. It's usually about finding stuff that came from the screen, on the screen.

martona commented 6 months ago

Great work on the ImagePutBuffer project btw, it's immensely useful and a joy to use.

iseahound commented 6 months ago

Thanks! I personally have zero use for ImageSearch/PixelSearch myself, so any work on it is purely for learning, academic, or for fun. With that said, one common use case is when the target image is a sprite (meaning there is a character surrounded by transparent pixels). By ignoring those regions, one can search for a non-square collection of pixels. (You can do: ImagePutBuffer({sprite: "cat.png"}) or buf.TransColor()).

SIMD intrinsics are best for when repetitive data can be streamed, such as in a loop. There's some outer logic which controls how often the subimage loop is entered / exited which can be highly optimized by the CPU. For example, if one branch is 90% vs 95%, the 95% approach can be over 2x faster because it can anticipate it twice as much. To me, optimization is like a layer cake. Each level of complexity has different things that should be optimized. For example, the current ImageSearch + variation is unoptimized, because preprocessing can be done to split the target image into high and low variation images instead of encoding a SIMD instruction to do color + variation and color - variation.

My testing showed 3.4-3.8x speedup for 128-bit registers, and maybe a 50% increase over that for 256 and maybe 20% for 512. It's wildly CPU dependent, with better results nearing the theoretical 2x for better hardware.

wind0204 commented 6 months ago

Awesome, I can't wait to see which SIMD instructions are being used. :D

martona commented 6 months ago

I see. Well, transparency can be handled trivially in the lo/hi mask precalculation phase where you can just set 0/0xffffffff for the masks in the locations that you want to ignore.

I sent you an invite to the repo with imagesearch. I just extracted it from the project I made it for, so it's pretty barebones re. tests and stuff.

iseahound commented 6 months ago

@martona It seems we have the exact same approach: https://github.com/iseahound/ImagePut/blob/master/source/pixelsearch2x.c which is fantastic.

From my understanding, you're currently searching for one image correct? The approach with popcount/BitScanForward required a lot of extra C code on my end, so I simply allowed it to fall back to the "v1" approach when the mask returned positive.

__m128i vmask = _mm_set1_epi32(0xFFFFFFFF);

should this be -1 in my code?

iseahound commented 6 months ago

@wind0204

Some additional results:

https://godbolt.org/z/M6nofK9hs

martona commented 6 months ago

@iseahound no, -1/0xffffffff doesn't matter.

lzcnt is relatively new, with the GCC v4/v3... march buckets it's only in v4, same as AVX512. but bsf is just an extra subtraction to do the same thing. to do it without is very costly; there's a pretty good C version in my code but it is a noticable hit.

same for popcnt, it ends up being a whole lot of bit twiddling.

i think both are worth it though even if you have to fall back to mucking around with the bits, there's no way it's not faster than falling back to doing it byte-by-byte in C.

martona commented 6 months ago

@wind0204

I was going to invite you to the repo but github wasn't giving me any results when I put in your name in the invite collaborator box. Weird.

So I made the repo public. you can see it at https://github.com/martona/imgutil

iseahound commented 6 months ago

I think I'm hitting theoretical maximums with basic ImageSearch.

580.70 fps - imagesearch1https://godbolt.org/z/M6nofK9hs 749.15 fps - pixelsearch1x https://github.com/iseahound/ImagePut/blob/master/source/pixelsearch1x.c

It's a SSE2 vs basic C comparison, so I think this works quite well. Next step is to vectorize.

iseahound commented 6 months ago

Will probably revisit this at a later date. Having trouble with the assembly—using -Os produces the best binaries, but the most used variables seem to be put into the stack instead of a register. Also not sure why -O3 insists on movdqa instead of movups when the latter instruction is one byte shorter.

wind0204 commented 6 months ago

@wind0204

Some additional results:

* Replacing div with cmov didn't produce a speedup that was significant. In addition, it's the 3rd check after the focused pixel and the top-left pixel, so the additional code for `left` wasn't worth it.

I guess it didn't provide performance boost because it was not in subimage loop? because I still believe that CMP + CMOV(one constant) combination is better than a DIV.--I believe it still without any profiling/benchmark experience and knowing the truth though.

I agree, third check wouldn't affect the performance much, that was the first check out of two checks in the code I wrote.

* I also tried removing the line `if (trans || c1 == *(current))` because it seemed redundant, but there was no difference in speed.

Does that mean it's better to remove the line for a simpler code?

wind0204 commented 6 months ago

749.15 fps - pixelsearch1x https://github.com/iseahound/ImagePut/blob/master/source/pixelsearch1x.c

I wonder how much faster it would perform with movdqa instead of movdqu in pixelsearch1x.c since I think it's worth to 16-byte align big haystack images. ;-p --16 bytes... is actually 4 pixels, it would be beneficial in almost all cases if the performance boost is meaningful.

iseahound commented 6 months ago

Does that mean it's better to remove the line for a simpler code?

Not sure, the idea is that it happens before the x-domain check. This has a side effect of hiding the div until it's needed. You're right that div is "slower", but only in a high performance scenario. Even when I try to trigger div more, the benchmarks don't show a difference here.

wind0204 commented 6 months ago

Does that mean it's better to remove the line for a simpler code?

Not sure, the idea is that it happens before the x-domain check. This has a side effect of hiding the div until it's needed. You're right that div is "slower", but only in a high performance scenario. Even when I try to trigger div more, the benchmarks don't show a difference here.

As you can see with the quote below, I was talking about removing the line if (trans || c1 == *(current)), not CMOV instruction! And I think code's simplicity does help not as much as throughput though.

* I also tried removing the line `if (trans || c1 == *(current))` because it seemed redundant, but there was no difference in speed.

Does that mean it's better to remove the line for a simpler code?

iseahound commented 6 months ago

Yes. that's exactly what I meant.

wind0204 commented 6 months ago

Yes. that's exactly what I meant.

Aha, you meant that the line lowers the frequency of the DIV operation--so removing the redundant operation doesn't really improves throughput-- I'm sorry for my poor reading skill. and lengthening the thread even longer! It's embarrassing :(

wind0204 commented 6 months ago

749.15 fps - pixelsearch1x https://github.com/iseahound/ImagePut/blob/master/source/pixelsearch1x.c

I wonder how much faster it would perform with movdqa instead of movdqu in pixelsearch1x.c since I think it's worth to 16-byte align big haystack images. ;-p --16 bytes... is actually 4 pixels, it would be beneficial in almost all cases if the performance boost is meaningful.

According to this article, it's worth a try. I'm sharing this just in case you're interested.

iseahound commented 6 months ago

That's a good catch! I didn't understand your previous comment before, but the article made it much clearer. I assumed the aligned version would be used - for some reason I thought loadu meant load unsigned.

wind0204 commented 6 months ago

I thought the image buffers were not aligned to 16 bytes, are they? According to the article I shared, if the data is already aligned the performance of the unaligned vector moving instructions is not worse on modern processors, ChatGPT 3.5 thinks it's worth to use the aligned mov instructions though.

iseahound commented 6 months ago

I'm not sure. They're allocated using Buffer, which should generate a nice alignment by default. There's actually working DirectX screen capture inside ImagePut I believe, I just never hooked it up to the callable APIs.

I have no recollection but apparently... I said:

That's a good question. I suppose the answer, "that's just the way it is" doesn't really answer your question satisfactorily, so I went ahead and did some research. Turns out, in the video memory (VRAM) DirectX uses something called High Level Shader Language (HLSL) which necessitates a 16 byte boundary. That means the screen width should be divisible by 4, because 1 pixel (ARGB) has 4 channels, and each color channel is 1 byte. So I made a mistake earlier, when I said it should be divisible by 16, the screen width needs to be divisible by 4.

As for why? It's probably faster to compute in blocks of 128-bits, (16 bytes), because of how the video card is designed. Now keep in mind, when you're doing DirectX screen capture, what's really happening behind the scenes is a memory transfer from the video card (VRAM) to your computer’s memory (RAM). And the really clever part about DirectX11 is that it doesn't copy the memory until you ask for it. That's why DX11 is so fast compared to BitBlt.

TL;DR Alignment on memory = faster.

https://www.autohotkey.com/boards/viewtopic.php?t=92622

iseahound commented 6 months ago

MFCreateAlignedMemoryBuffer ?

iseahound commented 6 months ago

Or just overallocate and extra 512-bits (for AVX512) and "start" the memory at that boundary.

wind0204 commented 6 months ago

I still believe that CMP + CMOV(one constant) combination is better than a DIV.--I believe it still without any profiling/benchmark experience and knowing the truth though.

I've found a good document on the throughput--or the reciprocal of the throughput- of each instruction: https://www.agner.org/optimize/instruction_tables.pdf It's interesting to see FDIVs' much higher throughput than integer divisions thereof by the way..

CMOV is as quick as MOV, and CMP is very quick too, sometimes even quicker considering the throughput-- 1/0.25 > 1/0.3--. Now I can say that I do know the fact, not only just believe... :smile: