Image processing performance

techyian commented 4 years ago

Image processing speed on full size frames is relatively slow. My current solution has been to split the image into 4 rectangles and run a Task against each of them to perform the processing. It works well for small images and I noticed a significant speed improvement. Unsure as to how much more I can split the image up before no further improvements are seen.

MV10 commented 4 years ago

New Year's Day? You're as bad as I am. 😀

Does this refer to image processing such as the matrix operations, and could the Parallel changes made in the motion detection performance PR #171 also apply here?

techyian commented 4 years ago

:D I don't really follow the hype of New Year, I prefer Christmas to be honest. But yes, I'd like to replicate the performance work you've done for motion detection on the matrix operation code. Could there be a common place for this code? I'd rather not duplicate it in two places.

MV10 commented 4 years ago

Starting another pass on PR #175 to get FrameAnalyser ready for this. I think it's going to work well.

MV10 commented 4 years ago

Ian, in parts of the motion detection code we discussed the use of fields instead of properties for speed. I don't think I mentioned the specific numbers, but if you recall the tests where it completed motion detection in 6ms per frame, I accidentally (force of habit) set up a new value as a property and that alone slowed it to 12ms per frame. Changing it back to a field brought it back to 6ms. Property accessor method calls are relatively expensive in tight, performance-sensitive loops.

It occurs to me that MMALSharp's hot-path pattern is generally Apply(ImageContext context) in various interfaces, and I suspect performance could be improved throughout by simply changing ImageContext properties into fields. I certainly understand the Computer Science 101 theory of preferring properties over fields for public members, but in this library and on these devices, I also feel that breaking the rules might be warranted if it produces significant performance improvements. (And after all, that rule of thumb in C# is mostly based on the fact that interfaces can't declare fields, and ImageContext isn't represented anywhere by a separate interface definition.)

Not really the point of this thread, but it's still image processing performance, and I didn't think it necessarily warranted a separate discussion. Of course, benchmarking is needed if you're open to considering this. It's such a simple change, I'll probably start wondering about it to the point I just try it to see what happens. :)

MV10 commented 4 years ago

Minor issue with the wiki image processing examples. They show creating a new ImageContext (with a constructor argument that doesn't exist) as the second argument to Manipulate -- in the code the argument is actually an ImageFormat object. I think they should be something like this. Assuming that works when I start doing perf tests in a bit, I'll fix the v0.7 wiki.

imgCaptureHandler.Manipulate(context =>
{
    context.Apply(new SharpenProcessor());
}, ImageFormat.Jpeg);

Edit: Fixed. https://github.com/techyian/MMALSharp/wiki/Image-Processing

MV10 commented 4 years ago

Baseline elapsed times (Stopwatch start/stop around the ApplyConvolution calls), 10 runs of each at 1296 x 972 RGB24.

Sharpen: average 1085ms (min 1020ms, max 1234ms) Blur 3x3: average 1066ms (min 1018ms, max 1116ms) Blur 5x5: average 2372ms (min 2193ms, max 2530ms) Edge-Hi: average 1078ms (min 1003ms, max 1118ms)

Of course, I didn't actually need to test all of them -- it's the matrix size that matters. Oh well. I wanted to see the output, too, I hadn't played with this part of the library at all yet. And I guess it's a validation of sorts that all the 3x3 matrix ops are very close to one another.

MV10 commented 4 years ago

Having never played with still images in the library, I'm trying to figure out the best way to get a raw frame. The image encoder doesn't like RGB24 / RGB24. I had thought to use TakeRawPicture, but it doesn't invoke the Manipulate delegate -- should it?

var cam = GetConfiguredCamera();
cam.ConfigureCameraSettings();
using (var imgCaptureHandler = new ImageStreamCaptureHandler(ramdiskPath, "jpg"))
{
    imgCaptureHandler.Manipulate(context =>
    {
        context.Apply(new SharpenProcessor());
    }, ImageFormat.Jpeg);

    await cam.TakeRawPicture(imgCaptureHandler);
}
cam.Cleanup();

I also see that ConvolutionBase doesn't do Bitmap format conversion when the image is raw. Since we want to operate on raw data for all scenarios, we'll have to rethink that bit. Maybe just a convert-or-don't flag, or maybe even just passing null as the ImageFormat with the Manipulate delegate. (Edit: Now it checks ImageContext.StoreFormat and if null, it leaves the data raw.)

Edit: It works if I set the camera encoding/subformat, including calling the delegate. But oddly it executes the delegate twice. I added a temporary line of code to dump ImageContext.Raw to the console from within ApplyConvolution and I get this:

Initializing...
Raw image? True
Raw image? True

I take that back. It's calling the delegate but it never completes. I guess that call is on another thread and the app shuts down before ApplyConvolution runs more than a couple of lines.

So I'm back to needing a good way to push a still raw frame into the handler. (Meanwhile, until The Expert arrives to advise me, I can just load a raw file for performance testing purposes, of course.)

MV10 commented 4 years ago

Not real-time but definitely faster.

I want to try a few more things, but after the initial conversion to Parallel.ForEach on a 32 x 32 grid:

3x3: average 384ms (min 374ms, max 383ms) ... 65% faster 5x5: average 942ms (min 928ms, max 954ms) ... 60% faster

MV10 commented 4 years ago

I should have thought of this earlier, 1296 x 972 is kind of a weird resolution ... it doesn't divide evenly with the 32 x 32 grid that I was carrying over from 640 x 480 motion detection. (The baseline timings are valid though, it does divide evenly by 4.) Factors of 3 are the only way to cover all pixels. A couple of reasonably large yet even divisors are 27 and 54, producing 729 and 2916 cells, respectively.

Either choice introduces approximately the same performance hit (about 10ms for 3x3 and 35ms for 5x5), probably more to do with processing an extra row and column per cell than anything else. From my more exhaustive tests with motion detection (video, so literally thousands of frames over the couple of days I worked on it) I think the 1000-cell range is probably the limits of what parallel processing can achieve with respect to parallelism vs thread context switching overhead on the Pi4.

One downside, though, is that the library consumer does need to think about this divisor. I suppose since the two camera modules have a limited list of native resolutions, we could think about assigning optimal divisors and making custom values optional with a ConvolutionBase constructor overload.

I spent some time considering all of the standard resolutions (v1 camera, v2, and HQ) and it turns out only four of them work with a single divisor for both horizontal and vertical cell count. Based on that, I went ahead and added separate horizontal and vertical values, and built a list of recommended values that are applied if the user doesn't provide different values. This way the user only has to think about this if they're processing oddball resolutions.

Now only 1640 x 922 (camera v2 mode 5) is a problem, 922 is only divisible by 2 and 461. (Even if we use 461 for two-pixel-high cells and a large number of horizontal divisions, on the theory that it's really total pixel count that matters for parallel processing and not the precise shape, that wouldn't work here, 2 pixels is smaller than a 3x3 matrix. And a roughly square shape does matter for motion detection, which also uses these same settings.)

public static IReadOnlyDictionary<(int width, int height), (int horizontal, int vertical)> RecommendedCellCounts
    = new Dictionary<(int width, int height), (int horizontal, int vertical)>(13)
    {
        { (1920, 1080), (30, 30) }, // 900 cells    64 x 36
        { (2592, 1944), (36, 36) }, // 1296 cells   72 x 54
        { (1296, 972),  (27, 27) }, // 729 cells    36 x 27
        { (1296, 730),  (48, 10) }, // 480 cells    27 x 73
        { (640, 480),   (32, 32) }, // 1024 cells   20 x 15
        { (3280, 2464), (40, 22) }, // 880 cells    82 x 112
        { (1640, 1232), (40, 22) }, // 880 cells    41 x 56
        { (1640, 922),  (40, 23) }, // 920 cells    41 x 40.09 (922 has no useful divisor)
        { (1280, 720),  (20, 36) }, // 720 cells    64 x 20
        { (2028, 1080), (26, 36) }, // 936 cells    78 x 30
        { (2028, 1520), (26, 38) }, // 988 cells    78 x 40
        { (4056, 3040), (00, 32) }, // 988 cells   156 x 80
        { (1012, 760),  (00, 00) }, // 874 cells    44 x 20
    };

MV10 commented 4 years ago

Sorry, I know I'm creating quite the backlog of commentary here -- but I wonder if the Bound method should clamp to the cell's lower X,Y values instead of zero? I can't detect a difference in the output either way, maybe it doesn't go negative that often with the kernels in the library.

private int Bound(int value, int startIndex, int endIndex)
{
    if (value < startIndex)
    {
        return startIndex;
    }

    if (value < endIndex)
    {
        return value;
    }

    return endIndex - 1;
}

MV10 commented 4 years ago

I shaved off maybe 10ms with a few more changes (primarily, sticking to double until the very last byte cast, and also aggressive inlining of the clamp method), but since I already have it wired up for timing, I tried changing all ImageContext properties to fields. As I suspected, the results are dramatic. Over 10 runs of the 5x5 blur at 27 cell divisions, average time is 699ms (down from 942ms), with a range of 694 to 708 (so far less variance, too) -- a 26% improvement. A 3x3 kernel is now just 300ms (down from 384ms), range 294 to 305 -- a 22% improvement.

And again, that change should yield performance improvements literally everywhere in the library. I suspect the library has other hot-path properties which would probably benefit from being changed to fields, but this was an obvious target.

How do you feel about it?

MV10 commented 4 years ago

Speed seems pretty directly proportional to resolution. I clocked 10 runs against a 640 x 480 image. A 3x3 matrix averages 116ms and a 5x5 averages 209ms. Still not full motion, but not bad. It would definitely be usable for motion detection with some appropriate matrix applied.

I think I've pushed this about as far as it can go. I was trying to find a way to parallel process the kernel matrix itself but that would require violations of thread safety limits. There's some pre-processing that could be done that would speed up multiple passes a lot, but I doubt that would be a common use-case, I don't think it would rise to the level of full motion video, and it would use a lot of memory.

Also having to prep all that extra memory would probably erase any speed gains ... I looked into matrix linearization and also DFFT which are common ways of speeding up kernel convolution but they all seem to lead to that problem. For example, at 1296 x 972 x 24 bits, matrix linearization for a 5x5 kernel would chew up ~184 MB of extra memory, lol. Edit: I also calculate FFT would require ~88 MB of extra memory -- they can dramatically speed up convolving, turning millions of calcs into a few tens of thousands -- maybe a direction to explore if we get GPU support figured out.

It has been another very interesting journey! Apologies once again for the Wall of Text. Hopefully you can guide me to generating raw RGB24 snapshots into a capture handler with post-processing, and I'm quite interested in hearing what you think about the ImageContext situation above.

MV10 commented 4 years ago

Something interesting that I don't understand -- it appears the kernel is applied 90 degrees from what I expected. Yet the row/col relationships in the processing loops look correct to me.

Note the middle row / column of "2" values in this kernel -- as documented everywhere online, this should edge-detect vertical lines, but it actually highlights horizontal lines:

{ -1,  2, -1 }
{ -1,  2, -1 }
{ -1,  2, -1 }

And similarly this highlights vertical lines rather than horizontal, as expected:

{ -1, -1, -1 }
{  2,  2,  2 }
{ -1, -1, -1 }

MV10 commented 4 years ago

Quick note (for lack of a better place to put it), based on comments from jamesh about a year ago, it looks like OpenGL may never be an option. The camera also ties up the GPU, unfortunately. Oh well, at least I only spent a day on it. :)

https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=247941&p=1514651#p1514101

The camera stuff rewrite is because KMS/DRM accesses the HVS hardware directly (akin the 3D), completely bypassing dispmanx. The camera however, uses dispman in the GPU firmware, and the two cannot coexist.

So, the camera software would need to be converted to use DRM/KMS, for the desktop case, and, IIRC, just DRM for the command line case. Not only that, but the camera ISP stack would also need to run on the ARM, (currently on GPU) so that would need to be rewritten as well. Huge amount of work. Multiple man years.

techyian commented 4 years ago

Hi Jon,

The issue you were seeing with the image processing being applied twice is because of a change made in b7820c7d where I set the minimum number of buffers against the camera's still port to 3, see here. In your tests, the still port is returning two buffers, one which is filled and another which is probably empty (length == 0). What I'll do is I'll add a check in the native port callback method which prevents the Callback Handler from being called if the buffer is empty. There is also a bug in ConvolutionBase which is preventing raw frames from being stored to an encoded format such as JPEG so I'll fix that too.

I have noticed an issue when storing raw frames to an encoded format (JPEG in my testing) in that the colours are inverted which I find strange. The culprit seems to be the Save method of the Bitmap class itself and to prove that I passed a null StoreFormat through to the Manipulate method and ran the resulting raw file through an MMALImageEncoder which worked fine with no distorted image. I've not had chance to look into this any further yet.

Ian.

MV10 commented 4 years ago

Hi Ian,

Oh good. What do you think about TakeRawPicture exiting before ConvolutionBase has a chance to finish executing?

There is also a bug in ConvolutionBase which is preventing raw frames from being stored to an encoded format such as JPEG so I'll fix that too.

Perhaps hold off fixing that, it changes completely in my implementation. I've added an ImageContext extension like this:

public static void FormatRawImage(this ImageContext context)
{
    if (!context.Raw)
    {
        throw new Exception("ImageContext does not contain raw data");
    }

    if (context.StoreFormat == null)
    {
        throw new Exception("ImageContext.StoreFormat does not define a target ImageFormat");
    }

    var pixfmt = MMALEncodingToPixelFormat(context.PixelFormat);

    using (var bitmap = new Bitmap(context.Resolution.Width, context.Resolution.Height, pixfmt))
    {
        BitmapData bmpData = null;
        try
        {
            bmpData = bitmap.LockBits(new Rectangle(0, 0, bitmap.Width, bitmap.Height), ImageLockMode.WriteOnly, bitmap.PixelFormat);
            var ptr = bmpData.Scan0;
            int size = bmpData.Stride * bitmap.Height;
            var data = context.Data;
            Marshal.Copy(data, 0, ptr, size);
        }
        finally
        {
            bitmap.UnlockBits(bmpData);
        }

        using(var ms = new MemoryStream())
        {
            bitmap.Save(ms, context.StoreFormat);
            context.Data = new byte[ms.Length];
            Array.Copy(ms.ToArray(), 0, context.Data, 0, ms.Length);
        }
    }
}

MV10 commented 3 years ago

Addressed by convolution changes added to PR #175.

techyian / MMALSharp

Image processing performance #108