techyian / MMALSharp

C# wrapper to Broadcom's MMAL with an API to the Raspberry Pi camera.
MIT License
195 stars 33 forks source link

FrameDiffAnalyser optimizations #161

Closed MV10 closed 3 years ago

MV10 commented 3 years ago

Hi Ian, I mentioned I wanted to add an overlay mask to block out areas with irrelevant motion, like trees or even the sky (clouds move!). This led me to some optimizations I'd like to propose.

You know how it works, of course, but I'll recap to make the point -- the Analyse method converts the test image List<byte> buffer to an array, opens that up as a memory stream, then hands that off to GDI to create a new Bitmap. This is done over and over for every comparison frame. I'm proposing to offload that to the two places where a test-frame is initially stored or updated. This should be a huge reduction in processing-overhead per frame. Constantly disposing those resources probably has a fair bit of overhead as well.

Those same processes are applied to the new comparison frame (unavoidable), so the GC as well as the large object heap are probably hammered constantly. There's not much to be done about that, IIRC the LOH threshold is just 85K versus however large a raw frame is, but reducing it by half certainly can't hurt.

Another possible related optimization is to not check for motion on each and every frame. Some of my IP cameras only check for motion a few times per second (especially the ones far away on wifi that can only trickle out maybe 7FPS). It's definitely scenario-specific and has to be tuned by the user, but it would be easy to implement along side the current every-frame approach.

To circle back to my desire for an overlay mask, I haven't used GDI since the 90s (I ran a dev shop writing multimedia training software -- back when people still said "multimedia" and a CD burner was the price of a small car) but I'm pretty sure GDI can logical-AND bitmaps, so if the test frame processing is done once up front, that additional step could also be handled in the same pass. Otherwise if GDI can't do it, it could be dome similar to the way CheckDiff works pixel by pixel.

I'm less certain about this last part, but I also feel like there should be GDI operations to quickly combine the two frames so the CheckDiff threads are only reading data from the combined image, and both images would be combined in one operation, greatly reducing the overhead of that operation as well.

I'll wait until we talk through the PR I opened yesterday, but if any of this sounds reasonable, I'm willing to have a go.

MV10 commented 3 years ago

Almost forgot another idea I've been kicking around. You speculated (maybe in the wiki?) that more tiling may improve performance. I'm new to the Pi so I only have Pi 4Bs, but these CPUs only have four cores and I'd imagine the OS always has at least one thread going, probably more, so of course those four Task.Run calls aren't truly parallel operations (well, probably not, never easy to say for sure). Given the high cost of context-switching, I'm not sure running more simultaneously would really help, although it would be interesting to to test that. But if more doesn't actually make anything worse, there are several tricks that finer-grained tiling brings to the table.

I was thinking about the way my DVR software does motion detection of my IP cam feeds, and it does use finer-grained tiling. I think it's 16 x 16, I'll have to go back and check. Of course, that's running on desktop hardware. I doubt it runs all 256 cells in parallel, but an i5 or i7 can certainly work harder than the Pi's ARM, so careful testing would definitely be important.

One way it uses these is for masking. The mask grid is even finer-grained than the motion detection grid, but if you completely mask a motion-detection cell, it simply never checks that cell at all. One of those slow cameras I have only motion-detects against a thin stripe across the middle of a walkway it faces. Nearly the entire scene is masked off, but I've never missed anyone walking past.

If the motion-detection grid is sufficiently fine-grained, you can introduce a rule that motion needs to be detected in a certain number of adjacent cells to trigger an event. My DVR does this so you can ignore smaller-than-people motion like the family dog (assuming you aren't into Great Danes, of course).

Many cells also potentially allows you to exit earlier. Today the Analyse method always processes all four quads, even if one quad has enough motion to trigger the event. A cancellation token here could help, but with many cells it would be even more beneficial.

I've also been wondering if it would be helpful to use this to prioritize areas. So, for example, you might prioritize the cells along the left and right edges when watching a sidewalk or a road. The benefit is potentially faster response time, similar to the early-exit thought above. That may be hard to configure, however. Perhaps pre-defined patterns would help (center first, each edge, sides, top/bottom, etc).

These changes are quite a bit more ambitious, and probably not as important as the earlier message, but I wanted to throw them out there for discussion. It would certainly be interesting to try these things to see if any of it helps.

MV10 commented 3 years ago

A quick note to say that yesterday I played with some of the ideas in the first post and results look pretty good. I don't even think it's necessary to do the Bitmap conversions or the lock/copy -- I'm thinking the List<byte> buffers ought to be directly usable. Does conversion to Bitmap buy us something that I'm overlooking?

Just storing the test frame Bitmap was perhaps ~20MB average reduction of memory usage, although CheckForChanges only gained an average of about 5ms. Might be bigger gains on older-model Pis though. More on that later when I've run through some other scenarios.

I guess GDI+ omits a lot of stuff I used in the Win32 days so the operations I had in mind like AND-combining aren't available, but that's no big deal. Anyway, still working on different angles but I'll have a look at your PR comments first, a bit later today.

MV10 commented 3 years ago

I didn't really have the time I feel that PR needs, so I continued looking into this.

Huge improvements, nearly 4X better throughput now.

I tested by averaging several runs at 30 seconds of motion detection with no actual detection events. Memory usage was an unscientific monitoring of pmap (sudo watch -n 0.1 'pmap $(pgrep dotnet) | tail -n 1'). I tested three versions -- the original (meaning the code in my open PR, so it has the test frame refreshes), a version that stored the test frame Bitmap rather than rebuilding it from the List<byte>, and a version that was nearly a rewrite using no unsafe code and storing the test frame as a managed byte[] and generally simplifying the whole thing (hooray).

Memory usage across the three variations:

329MB - 382MB, average about 362MB 318MB - 368MB, average about 348MB 321MB - 362MB, average about 335MB

But more interesting is the throughput -- I timed the entire CheckForChanges and counted how often it executed:

97ms, 145 calls 92ms, 149 calls 25ms, 255 calls

I can't take much of the credit though. Part of that last leap was eliminating PrepareDifferenceImage and ApplyThreshold which I think were probably leftovers from a different approach to motion detection. There also was no need for the MemoryStream stuff in Analyse since the motion detection input is always raw.

Memory-wise, those last two are also loading a motion mask (640x480x3bpp = 900K). Instead of modifying the images with the mask, I think I'll just read the mask byte array as part of CheckDiff, which should be much faster. I suppose in theory it could pre-scan the mask to look for rows/columns at the edges which are completely masked to reduce the CheckDiff scan area, too.

Anyway, pretty pleased with this, although I need to clean things up. I still won't confuse the open PR with all of this, I'll send it up separately once we're done with that one.

techyian commented 3 years ago

That all sounds really promising, thanks Jon. Looking forward to seeing your PR.

MV10 commented 3 years ago

Checking the mask inline with CheckDiff seems to consistently add just 1ms, completing 223 passes in 30 seconds. Trimming rows or columns off the loop based on fully-masked edges (which I faked for timing purposes) made no significant difference unless the masked areas are huge, so I think that can be slotted as micro-optimization and ignored.

The mask BMP file has to match the resolution and BPP. All-black pixels are considered masked and the rest are ignored. As an interesting bonus, just using a real mask slightly increases performance as it skips the rgb1 / rgb2 reads and comparisons. My timings were done with an all-white mask, nothing actually masked, worst-case scenario.

MV10 commented 3 years ago

Ian, have you ever considered using OpenGL on the GPU for this? My wife and I did some game programming a few years ago, my HLSL/GLSL is probably a bit rusty, but this is trivial in shader terms, especially at lower resolutions than you're typically working with for shaders. Even though ~26ms is getting pretty tight, I just realized the two images could be processed on a GPU in a fraction of the time we're doing it here. The only part I'm unclear about is the relationship between the Pi GPU and the ISP. I know they're related, maybe all the other camera stuff is already tying up the GPU. (This just popped into my head, I haven't researched OpenGL support on the Pi or any of the related questions in the least...)

Edit: I had a look at OpenGL and/or Vulkan for the Pi -- not quite as mature yet as I'd hoped. Oh well.