Analysis re-reworking - Githubissues

philipbelesky commented 4 years ago

Hey, this is just a tracking PR for some of the ongoing changes that cut across #30 #33 #13; will post highlights as I go and signal when a full review is ready. As of now:

Analysis.cs was getting quite bloated so it has been split into two files: one for the base classes and one for the specific analysis options
Coloring now has a separate per-pixel loop to accommodate the need for this initial slope/elevation coloring to use neighboring pixels that have not been assigned yet when initially building the point cloud
There is a hook for the MeshGeometryAnalysis class of analyses to output geometry after the point cloud/mesh items have been made; e.g. for the water plane or contouring.
The performance of the prototype slope analysis (using just 4 neighbors) is about 50ms in total. I'm sure the function itself can be micro-optimized, but I suspect a better overall method will definitely be needed. The calculate-along-edges approach should be tried and I'd imagine the coloring loop in general would also be quite amenable to multi-threading

mariuszhermansdorfer commented 4 years ago

Looking forward to seeing this implemented! 50ms sounds like a relatively big overhead for what we are doing. Essentially it is the same logic as in the GaussianBlurProcessor -> iterate over the entire array of pixels and define a value for each of them based on their neighbors. Blurring takes 1-2ms, we should be able to achieve a similar execution time with slope as well.

Maybe @Sergio0694 has some ideas on how to approach this.

Sergio0694 commented 4 years ago

Hey @mariuszhermansdorfer - I'd be happy to help but I'm not 100% sure about what you're trying to do here. A couple small tweaks right off the bat, from what I'm seeing:

Replace foreach loops with for loops (saves creating an iterator)
Replace double variables with float, and use MathF instead of Math, if you don't need the extra 64 bit precision

Other than that, what kind of effect are you trying to apply here? If it's a 2D convolution with some kernel that is linearly separable, then yeah you can replicate my gaussian blur effect by using two 1D convolutions instead of one. This of course gets the computation cost down from O(n^2) to O(n). If instead you're doing some other additional work on the surrounding pixels, it might be trickier to optimize.

mariuszhermansdorfer commented 4 years ago

Thanks @Sergio0694! Sorry for not providing a clearer explanation of what we are trying to accomplish. The goal is to visualize slope of the terrains created in the sandbox. To achieve that, we need to calculate a value for each pixel and store it in an array. This is then passed to our meshing function as vertexColors information. Slope is defined as rise/run. rise is equal to the difference in elevation and run is the distance in the XY plane between points. Since we operate on a defined grid of pixels run is constant (its three constants to be precise, one for the X, one for the Y, and for the XY direction). Since each pixel can have up to 8 neighbors, we calculate an average of up to 8 individual comparisons and assign this value to the analyzed pixel. This logic has to be repeated for every single one of our 217088 pixels.

Does this sound like a good candidate for linear separation?

Sergio0694 commented 4 years ago

I'm sorry, I still don't really understand the exact operation you're describing. 1) What is the type of the source image? Is it just an array of numeric values (eg. int), or is something like an array of RGB vectors? 2) "rise is equal to the difference in elevation" - elevation between what?

Anyway, if the total area being processed for a single pixel is just the 3x3 square centered in that pixel, even if you could come up with a linearly separable kernel to represent that operation it likely wouldn't bring a performance increase, as the kernel is so small already. What I suggest is to: 1) Parallelize the loop on a per-row basis, just like I did in my gaussian blur 2) Unroll the loop and make all the possible values constants.

As for point 2, as an example: imagine that for each pixel you were just summing its value and the one of the 2 following pixels, if within bounds (completely made up example, bear with me).

Classic implementation (with no optimizations):

const int width = 640, height = 380;
int[] image = new int[width * height];

for (int y = 0; y < height; y++)
{
    for (int x = 0; x < width; x++)
    {
        int sum = 0;
        for (int i = 0; i < 3; i++)
        {
            if (x + i == width) break;
            sum += image[y * width + x + i];
        }

        image[y * width + x] = sum;
    }
}

Unrolled implementation (with no other optimizations):

for (int y = 0; y < height; y++)
{
    for (int x = 0; x < width; x++)
    {
        int sum = image[y * width + x];
        if (x + 1 < width) sum += image[y * width + x + 1];
        if (x + 2 < width) sum += image[y * width + x + 2];

        image[y * width + x] = sum;
    }
}

Note though that the second is more of a micro-optimization, so I'd suggest to just making the code run in parallel, that might very well be enough for you already. Following up from the previous example, the parallelized version could look something like this:

Parallel.For(0, height, y =>
{
    for (int x = 0; x < width; x++)
    {
        int sum = image[y * width + x];
        if (x + 1 < width) sum += image[y * width + x + 1];
        if (x + 2 < width) sum += image[y * width + x + 2];

        image[y * width + x] = sum;
    }
});

Hope this helps!

P.S.: bonus example if you really want to squeeze out as much performance as possible:

Parallel.For(0, height, y =>
{
    ref int row = ref image[y * width];
    for (int x = 0; x < width; x++)
    {
        int sum = 0;
        if (x + 1 < width) sum += Unsafe.Add(ref row, x + 1);
        if (x + 2 < width) sum += Unsafe.Add(ref row, x + 2);

        Unsafe.Add(ref row, x) += sum;
    }
});

But again, first just make the code run in parallel 👍

mariuszhermansdorfer commented 4 years ago

Thanks, @Sergio0694! To answer your questions:

The source image is a double[], which is a result of the Gaussian blurring function you wrote for us earlier.
By elevation I meant the actual values of each element in an array. So rise is a difference between two values in the above double[]. It really is as simple as that.

Following your suggestion I wrote a very verbose slopeCalculator. It first takes care of the edge cases for individual pixels in the corners of the image (NW, NE, SW, SE), rows (first and last) and columns (first and last) and then iterates over the main array. Results are pretty satisfying, but also quite surprising. @philipbelesky, I went with a D8 approach just to see how much of an overhead it would cause.

Sequential for loop - 2ms

Parallel for loop - 14 ms

I suspect it is because the logic needs to read pixel values from other rows which may be locked by other threads writing to it.

Any ideas on how to optimize it from there? Maybe some of your unsafe mojo would help :)

Sergio0694 commented 4 years ago

@mariuszhermansdorfer So, by quickly looking at your code, a few suggestions come to mind: 1) Absolutely swap your double for float values. The loss in precision shouldn't really be noticeable at all, but it should speed up calculations by a fair bit. I mean to replace all double values, so both in the gaussian blur processor and in this new code your wrote. 2) After you're done with the previous point, remember to replace all your Math calls with MathF calls (same exact methods, just working on float values and not double 3) Especially if this code you wrote is being called repeatedly (possibly multiple times a second), do this change in this line:

// Replace this
ushort[] slopeValues = new ushort[width * height];

// With this
ushort[] slopeValues = ArrayPool<ushort>.Shared.Rent(width * height);

ArrayPool<T> (see this blog post for more details, and the docs page here) is a high-performance pool of arrays, which makes it so that you can reuse arrays instead of forcing the runtime to allocate a new one every time. Just be wary of a couple caveats:

When renting arrays, you'll have to manually return them to the pool when you're done, you can't just throw them away as usual. To do so, just use ArrayPool<T>.Shared.Return(T[] array).
- The length of the returned array might be greater than the requested length, so make sure to avoid iterating on the Length property of your array (since you're always iterating with your previous width and height values, that should be taken care of already 4) At this line you're accessing a variable from an outer scope, within a parallel loop. Since you only need a temp variable within that parallel loop code, avoid doing that and just declare another local variable for that method block, and use it from there.

After you're done with all of these I might also do a pass replacing those array accesses with Unsafe.Add accesses, but that's a micro-optimizations which might not be that necessary here.

Hope this helps!

mariuszhermansdorfer commented 4 years ago

@Sergio0694, you're awesome! Point number 4 made an enormous difference! Localizing the variable took it down to less than 1 ms. Most satisfying :)

I will implement the rest of your suggestions tomorrow to make it even snappier. I have two questions, though:

MathF is part of the System namespace. It isn't recognized, though. Am I missing a reference? Also, since it is not the first time I run into this issue, where can I find this information in the future. It seems to be missing in the docs.
When would I return the rented array? Would you do this after returning the function result? https://github.com/mariuszhermansdorfer/SandWorm/blob/c1c867abc91ee299d7ca1103b890f093aca97061/SandWorm/SlopeCalculator.cs#L125

Sergio0694 commented 4 years ago

@mariuszhermansdorfer Ahahah thanks, happy to help! 😊 It's great that you got such a performance improvement just by changing that variable ahahah

As for your questions: 1) My bad, I just double checked the docs and apparently MathF is not available for .NET Framework, but just for .NET Core. Unless you move your whole project to .NET Core 3.0 (which would also be faster in general, though you'd have to test to see if all the packages you're using support it), unfortunately you won't be able to use that API. 2) No, you can't return it from that method, otherwise the caller would receive an invalid array. The way you'd structure that is the following:

// Somewhere in your code when you need this
var slopes = SlopeCalculator.CalculateSlope(...);

// Do stuff with slopes...

ArrayPool<double>.Shared.Return(slopes);

That is, return that array when you're sure you won't be using it again. For instance, if you're getting that array within some other method, which is using it to do some other work and then just discarding it (ie. not returning it or saving it somewhere else), you could just return that array to the pool at the end of that method.

philipbelesky commented 4 years ago

Thanks also for your help @Sergio0694! I think we are stuck on .NET 4.5 as this is what our host environment provides, although it is worth testing if we can get around that.

@mariuszhermansdorfer the new CalculateSlope function shifts the coloring process outside of the per-pixel loop and operates with the entire point cloud as the input rather than on a per-point basis. To integrate this approach with the existing analysis classes, it looks like GetPixelIndexForAnalysis should be replaced with a method along the lines of GetPixelColorsForAnalysis that can can accomodate this 'all at once' approach.

We have a few different branches floating around that should probably be resolved first before pursuing that option. Is it OK to merge this branch in? I can then merge/integrate the new setup component, and then rework the analysis classes to integrate the method from performance/slope_parallel?

mariuszhermansdorfer commented 4 years ago

@philipbelesky, yes, please go ahead with the merging. I wasn't pushing any other commits to the master just to make sure I don't make your life harder with integrating the analysis part.

mariuszhermansdorfer commented 4 years ago

Thanks for clarifying things, @Sergio0694. As far as I can understand, we are stuck with .NET 4.5 unless we explicitly make our users install newer versions of the framework.

I must admit, I don't understand this explanation fully, but it seems we can't use the .NET Core for this particular project.

Sergio0694 commented 4 years ago

No problem! As for .NET Core 3.0, if these changes have already brought the time it takes to perform this section of the pipeline down to below 1ms I'd say you can just forget about that then. You might just want to throw in that ArrayPool<T> optimization just if you want to really go all the way, but if you're at 1ms already that might not even be noticeable anyway.

mariuszhermansdorfer commented 4 years ago

Is this one ready to be merged into master @philipbelesky? Feel free to do it when you think we're good to go.

philipbelesky commented 4 years ago

Great, done. I’ll have some time later this week to integrate the new structure of the slope algorithm and will try to replicate it for aspect at the same time.

mariuszhermansdorfer / SandWorm

Analysis re-reworking #34