mariuszhermansdorfer / SandWorm

Augmented Reality Sandbox for Grasshopper & Rhino
MIT License
20 stars 11 forks source link

Smooth depth pixels #11

Closed mariuszhermansdorfer closed 4 years ago

mariuszhermansdorfer commented 4 years ago

Test different smoothing algorithm (gaussian blur, weighted average, etc) and implement the best to work directly on the depth array.

mariuszhermansdorfer commented 4 years ago

Implemented both temporal (averaging across a given number of depth frames) and spatial (Gaussian blur) smoothing algorithms.

Here is a short video showing it in action

Here are some open issues:

  1. Gaussian blur currently takes the entire depth array as input, even though we only need a trimmed portion of it. This is a low hanging fruit for optimization.
  2. Gaussian blur is not really good at preserving edges. Keep searching for a better (and equally fast) solution.
  3. Research smarter ways of temporal averaging than a foreach loop fired for every single pixel.
  4. Accelerating this part on the GPU?

I'd appreciate your feedback @philipbelesky

philipbelesky commented 4 years ago

Maybe for 3 there is some sort of way of clustering (a quad tree?) that could be used to isolate particular pixels that need to be averaged rather than looping through the whole set.

I'm not sure about 4, I haven't done any real GPU programming, although it might be worth looking at for this and other forms of mesh analysis (e.g. water flow) that presumably would be sped up if offloaded.

mariuszhermansdorfer commented 4 years ago

I have tried various approaches today:

  1. Parallel.For for the main loop. This turned out to be a total mess to handle, given all the concurrent read/write operations to various lists. Seems to be a dead end.

During testing I realized, that operating on arrays rather than lists shaves off approx 3ms of execution time. This is implemented in the latest commit. #26

Previous version: image

Current version: image

  1. PointCloud generation seems to be quite optimal now. Meshing still takes some time, though. Tried replacing only these vertices in the mesh definition, whose Z-value has changed more than a given threshold. This worked, but turned out to be slower than batch-replacing all vertices as it is done now.
mariuszhermansdorfer commented 4 years ago

Looking for ideas on how to accelerate C# code on the GPU I've found ComputeSharp.

Haven't tried it yet, but it might be a good idea to check it out for averaging across frames, Gaussian blurring as well as #13, #14 and #15

Sergio0694 commented 4 years ago

Hey @mariuszhermansdorfer and @philipbelesky - I'll just reply here so both of you can read. First of all, thank you for your interest in ComputeSharp!

You can definitely give ComputeSharp a try for both the frames averaging as well as the gaussian blur phases, and I'd be happy to help out a bit with that. As for the gaussian blur part, you'll see that I've already implemented a GPU-powered gaussian blur algorithm that works with two 1D convolutions (since the gaussian kernel is linerarly separable) in one of the samples for the library, you can find the code here.

I'm not sure how much of a performance improvement you'll get here exactly since the data you're working on is quite small (considering a single frame), and using the GPU has some overhead (namely, copying the data to and from the GPU, plus some other stuff). It's definitely worth a try though.

Also, small note: HLSL doesn't have a ushort type, so you'll need to convert your data into an int[] array to be able to use it on the GPU. In general, it might not bring much of a performance improvement if the kernel size is very small, but it might if it gets larger. Same for the frames average, you'll probably get better relative results the higher the number of frames you do the averaging across.

About the frames averaging, if working on the GPU, I'd recommend keeping the frames in memory across iterations and only swapping the first and last one each time, instead of having to copy to and from the GPU a whole bunch of frames every time. And since I've mentioned that GPU acceleration scales a lot better the larger the data you're working on is, this is an aspect where I'd expect to see a nice performance bump if you start doing averages across a decent number of frames (eg. a 20+ frames average on the GPU should be noticeably faster than on the GPU).

I also have another suggestion in general: if you're dealing with compute-intensive scenarios like this, where performance needs to be as optimized as possible, I highly recommend ditching the array indexer and using the new Unsafe class and APIs as much as possible (see here).

As an example of a micro-optimization you might try out, you can convert loops such as this:

int[] array = ...; // some array
int sum = 0;
for (int i = 0; i < array.Length; i++)
    sum += array[i];

Into this:

int[] array = ...; // some array
ref int r0 = ref array[0];
int sum = 0, length = array.Length;
for (int i = 0; i < length; i++)
    sum += Unsafe.Add(ref r0, i);

Or, for instance, you can also try to replace manual copy loops such as this with this:

source.AsSpan().CopyTo(destination);

Which uses a fast implementation that comes with built-in loop unrolling, chunking (eg. copying the values as ulongs when enough data is present), etc. and which is more likely to be vectorized by the JIT compiler at runtime. I mean, there are a number of small tweaks like these you could try too.

Just to make another example, here's how you could convert your ushort[] array into int[], using these APIs (the Unsafe.Add<T> API takes care of the proper offsetting here):

public static void CopyAsIntArray(ushort[] source, int[] destination)
{
    ref ushort ru0 = ref source[0];
    ref int ri0 = ref destination[0];
    int length = source.Length;

    for (int i = 0; i < length; i++)
        Unsafe.Add(ref ri0, i) = Unsafe.Add(ref ru0, i);
}

If you do decide to experiment with ComputeSharp, definitely let me know how it goes! 😊

EDIT: had some time and tried to play around with this, I created a sample benchmark for the frames average part here. Note that my desktop has a Ryzen 7 2700X (8C/16T @ 4Ghz) and a GTX1080, so results may vary. Here's what I'm getting:

Method Frames Mean Error StdDev Median Factor
CpuAvg 150 2.540 ms 0.0048 ms 0.0045 ms 2.539 ms
GpuAvg 150 1.891 ms 0.0594 ms 0.1626 ms 1.936 ms 1.34x
CpuAvg 300 4.950 ms 0.0171 ms 0.0143 ms 4.946 ms
GpuAvg 300 1.736 ms 0.0686 ms 0.1935 ms 1.838 ms 2.85x

As expected, the larger the source data (and this is still pretty small), the greater the performance difference. If you were to calculate the average over, say, 1000 frames, the difference would be even bigger. For a smaller number of frames, say, 20 or 30, both methods take less than 0.5ms, but the CPU version is faster by around 2-300us in this case, so I'd say for such a small number of frames it's not worth it to use the GPU. It depends on how many frames you expect your project to work on.

That said, I might ask, at least for the frames averaging, I'm not sure this kind of algorithm is the right choice here at all. I mean, instead of just computing the average for each pixel across a frames buffer for each new frame you get, why don't you just keep track of the last frame that was removed, and simply do, for each target value:

value = ((value * nFrames) - valueFromOldFrame + valueFromNewFrame) / nFrames

This would always run in O(1), regardless on the number of frames you're averaging across. The point is that no matter how many frames you're considering, at each iteration there would always just be the two values at the start/end of the frames buffer that would be changing, all the others would remain constant anyway. I don't see why you'd want to actually recompute that partial sum every time.

mariuszhermansdorfer commented 4 years ago

First of all, thanks a million for this detailed reply @Sergio0694!

Ad Frame averaging - in our specific case, the amount of frames to average across has to be kept low. It is a balancing game between removing noise from the signal and keeping the sandbox interactive. If we were to average across 300 frames (10 seconds given Kinect's 30FPS), there would be ghosting issues with users' hands' movements being stored in the buffer.

Your suggestion to simplify the approach seems to be most optimal. Thanks a lot!

Ad. Gaussian blur - don't know whether it's asking too much, but would you be willing to take this challenge and adapt your current Gaussian blur kernel to our needs? Kinect spits out ushort[] values but we cast these to int anyway, so the HLSL limitation you mentioned shouldn't matter in this case.

Sergio0694 commented 4 years ago

Hey @mariuszhermansdorfer - no worries, and yeah I can write a gaussian blur sample for you, no problem. It's already there in my sample anyway, so it'd just need some minor tweaking. I could make some tests on both the CPU and GPU actually, just in case.

A couple questions: 1) What's the gaussian blur radius you expect to use? Like, something very small like 3 or something bigger like, say, 20 or 30? 2) Since you mentioned you'd cast that array to int anyway, do you mean I can just forget about the data being in ushort originally and just take int[] as both input and output?

mariuszhermansdorfer commented 4 years ago

Thanks a lot @Sergio0694!

To answer your questions:

  1. The expected radius will probably be around a few pixels. We want to get rid of the inaccuracies of the sensor but still keep the detail in the box. You can see the current implementation in action here.

  2. The output from the Kinect is an ushort[] it is then enqueed in the renderBuffer renderBuffer.Enqueue(KinectController.depthFrameData); then, we loop over all the pixels to get their elevation depthPoint = KinectController.depthFrameData[i]; and finally we can cast these to float to form a Point3f tempPoint.Z = (float)((depthPoint - sensorElevation) * -unitsMultiplier);

So we operate only on ushort in terms of precision, but have to cast to float in the end to be able to form a Point3f type. Don't know what's the most efficient way to approach the casting.

Another thing is the order of operations. Currently it works as follows:

  1. Read depth data
  2. Gaussian blur the depth array (for loop, operating on the whole 512x424 array)
  3. Average over multiple (blurred) frames & create individual points (another for loop, operating on a user-defined subset of the array)

A more correct approach would be to:

  1. Read depth data
  2. Average over multiple frames, thus removing noise in the signal
  3. Gaussian blur the averaged frame
  4. Create individual points

The reason for our current approach was to avoid introducing a separate loop just for the averaging. Ideas on how to optimize this are more than welcome.

Sergio0694 commented 4 years ago

@mariuszhermansdorfer alright, I've created a new branch and added my implementation of a gaussian blur, using parallel processing and linearly separable kernels.

The processor applies the effect on a float[] buffer, so the idea would be to do the following:

  1. Read depth data and "upcast" it to int[] (see my previous snippet)
  2. Compute the frames average using that O(1) method I described before. This will result in a float[] array
  3. Apply the gaussian blur to the result (the gaussian blur processor can be reused, so just create it once with the radius you need)
  4. Create the final points

I think that just with that O(1) approach for the frame average and the parallel processign in the gaussian blur, you should get a pretty noticeable boost in your scenario.

Hope this helps!

mariuszhermansdorfer commented 4 years ago

Thanks @Sergio0694! A very noob question - when I opened your GaussianBlurProcessor class, my Visual Studio immediately complained: Error CS0103 The name 'Unsafe' does not exist in the current context

How does one 'unlock' these?

Sergio0694 commented 4 years ago

@mariuszhermansdorfer It's because that class is not available in .NET Framework 4.5, so I had to install an additional NuGet package for it. Just rebuild the solution, and VS will install that package for you.

mariuszhermansdorfer commented 4 years ago

@Sergio0694 The issue was this guy: System.Runtime.CompilerServices;

Thanks a lot. Rebuilding solved it.

mariuszhermansdorfer commented 4 years ago

I've made some progress in the branch you started @Sergio0694. The upcasting to int idea is brilliant and allows us to trim the depth array to the exact dimensions needed in the same loop. Your Gaussian blurring method works well too, but it turns out it is slightly slower than what we had before. Short video here.

The other one is adopted from this repo, and here is some theory behind it.

You mentioned it could potentially be made even faster if this loop would be optimized. When I follow your suggestion and try to replace it with source.AsSpan().CopyTo(destination); VS again complains I'm missing an assembly reference. According to the docs, this method should be part of the System namespace, no? Any other ideas on how to make things snappier?

mariuszhermansdorfer commented 4 years ago

Implemented averaging across frames as well. Thanks for providing some inspiration @Sergio0694! I had to add some checks for edge cases, but the basic principle you suggested remains - add one element to a queue, remove one from the bottom if necessary, divide by number of elements in the queue.

https://youtu.be/GW9LiEMn8Fg

Having done some more testing, I think I'd like to stick to your Gaussian blurring method. It gives users finer control over the radius and is almost equally fast as the other one.

Sergio0694 commented 4 years ago

Hey @mariuszhermansdorfer - thanks for adding me as a contributor! 😊 So, that gaussian method you linked looks pretty interesting anyway, thanks for sharing! And glad to hear that my other implementation still works for you, yeah that was designed to both be fast and entirely customizable, so that you can use any radius you want.

As for that AsSpan() API, it's in the System namespace but it's not in the default assembly on .NET Framework, that's why I've added a reference to System.Memory, you'll find it there 👍 That's the same package that also includes Span<T> and other high-performance APIs.

I'm not sure I'm following your paragraph about the queue for the frame average, are you actually using a queue or was that just a figure of speech? Because with a queue you'd still be doing the operation in O(n), what I suggested was just to keep the actual averaged value, and mutate it across multiple pixels, adding and subtracting the first and last values in the "view" for the current pixel. Is that what you meant there?

EDIT: also, is this method called just once or once per frame?

mariuszhermansdorfer commented 4 years ago

Thanks for the explanation regarding the AsSpan() API. I'll have a look and see what it does. We're slowly reaching a point of diminishing returns, though. This part of the code is already pretty fast - shaving off another 20% is not going to result in noticeable speed-up. Probably worth investing the time in adding more features or fixing bugs :)

Apologies for not being to precise - I do use a queue but only to keep track of the last item to dispose of. It's just more practical when there is a FIFO construct to begin with.

I couldn't really figure out how to implement your logic one-to-one. It would be easier if we were dealing with an array of constant size and a constant number of frames to average across. In our case, the array length can change between frames (users trimming the extent of the scanning field interactively) and the number of frames to average can be adjusted live as well.

As a result I ended up storing a running sum for every pixel in a separate array. On each tick I add current depth pixel's value, remove last one from the queue and divide by the length of our render buffer. If any of the above mentioned variables change, everything gets reset.

Would be curious to hear your opinion.

The method you are referring to is called every 20 ms. That's how user input gets translated to our logic and we are able to push stuff back to the 3d modelling environment.

mariuszhermansdorfer commented 4 years ago

Implemented in the latest commit. #28 I ended up using your algorithm after all, @Sergio0694. It does the job within 1ms and gives users a lot of flexibility in tweaking the radius. Thanks a lot for your contribution!

Short video of the results

Sergio0694 commented 4 years ago

Hey @mariuszhermansdorfer - that's great, I'm glad I could help! 😊