Closed mariuszhermansdorfer closed 4 years ago
Implemented both temporal (averaging across a given number of depth frames) and spatial (Gaussian blur) smoothing algorithms.
Here is a short video showing it in action
Here are some open issues:
I'd appreciate your feedback @philipbelesky
Maybe for 3 there is some sort of way of clustering (a quad tree?) that could be used to isolate particular pixels that need to be averaged rather than looping through the whole set.
I'm not sure about 4, I haven't done any real GPU programming, although it might be worth looking at for this and other forms of mesh analysis (e.g. water flow) that presumably would be sped up if offloaded.
I have tried various approaches today:
During testing I realized, that operating on arrays rather than lists shaves off approx 3ms of execution time. This is implemented in the latest commit. #26
Previous version:
Current version:
Looking for ideas on how to accelerate C# code on the GPU I've found ComputeSharp.
Haven't tried it yet, but it might be a good idea to check it out for averaging across frames, Gaussian blurring as well as #13, #14 and #15
Hey @mariuszhermansdorfer and @philipbelesky - I'll just reply here so both of you can read. First of all, thank you for your interest in ComputeSharp!
You can definitely give ComputeSharp a try for both the frames averaging as well as the gaussian blur phases, and I'd be happy to help out a bit with that. As for the gaussian blur part, you'll see that I've already implemented a GPU-powered gaussian blur algorithm that works with two 1D convolutions (since the gaussian kernel is linerarly separable) in one of the samples for the library, you can find the code here.
I'm not sure how much of a performance improvement you'll get here exactly since the data you're working on is quite small (considering a single frame), and using the GPU has some overhead (namely, copying the data to and from the GPU, plus some other stuff). It's definitely worth a try though.
Also, small note: HLSL doesn't have a ushort
type, so you'll need to convert your data into an int[]
array to be able to use it on the GPU. In general, it might not bring much of a performance improvement if the kernel size is very small, but it might if it gets larger. Same for the frames average, you'll probably get better relative results the higher the number of frames you do the averaging across.
About the frames averaging, if working on the GPU, I'd recommend keeping the frames in memory across iterations and only swapping the first and last one each time, instead of having to copy to and from the GPU a whole bunch of frames every time. And since I've mentioned that GPU acceleration scales a lot better the larger the data you're working on is, this is an aspect where I'd expect to see a nice performance bump if you start doing averages across a decent number of frames (eg. a 20+ frames average on the GPU should be noticeably faster than on the GPU).
I also have another suggestion in general: if you're dealing with compute-intensive scenarios like this, where performance needs to be as optimized as possible, I highly recommend ditching the array indexer and using the new Unsafe
class and APIs as much as possible (see here).
As an example of a micro-optimization you might try out, you can convert loops such as this:
int[] array = ...; // some array
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
Into this:
int[] array = ...; // some array
ref int r0 = ref array[0];
int sum = 0, length = array.Length;
for (int i = 0; i < length; i++)
sum += Unsafe.Add(ref r0, i);
Or, for instance, you can also try to replace manual copy loops such as this with this:
source.AsSpan().CopyTo(destination);
Which uses a fast implementation that comes with built-in loop unrolling, chunking (eg. copying the values as ulong
s when enough data is present), etc. and which is more likely to be vectorized by the JIT compiler at runtime. I mean, there are a number of small tweaks like these you could try too.
Just to make another example, here's how you could convert your ushort[]
array into int[]
, using these APIs (the Unsafe.Add<T>
API takes care of the proper offsetting here):
public static void CopyAsIntArray(ushort[] source, int[] destination)
{
ref ushort ru0 = ref source[0];
ref int ri0 = ref destination[0];
int length = source.Length;
for (int i = 0; i < length; i++)
Unsafe.Add(ref ri0, i) = Unsafe.Add(ref ru0, i);
}
If you do decide to experiment with ComputeSharp, definitely let me know how it goes! 😊
EDIT: had some time and tried to play around with this, I created a sample benchmark for the frames average part here. Note that my desktop has a Ryzen 7 2700X (8C/16T @ 4Ghz) and a GTX1080, so results may vary. Here's what I'm getting:
Method | Frames | Mean | Error | StdDev | Median | Factor |
---|---|---|---|---|---|---|
CpuAvg | 150 | 2.540 ms | 0.0048 ms | 0.0045 ms | 2.539 ms | |
GpuAvg | 150 | 1.891 ms | 0.0594 ms | 0.1626 ms | 1.936 ms | 1.34x |
CpuAvg | 300 | 4.950 ms | 0.0171 ms | 0.0143 ms | 4.946 ms | |
GpuAvg | 300 | 1.736 ms | 0.0686 ms | 0.1935 ms | 1.838 ms | 2.85x |
As expected, the larger the source data (and this is still pretty small), the greater the performance difference. If you were to calculate the average over, say, 1000 frames, the difference would be even bigger. For a smaller number of frames, say, 20 or 30, both methods take less than 0.5ms, but the CPU version is faster by around 2-300us in this case, so I'd say for such a small number of frames it's not worth it to use the GPU. It depends on how many frames you expect your project to work on.
That said, I might ask, at least for the frames averaging, I'm not sure this kind of algorithm is the right choice here at all. I mean, instead of just computing the average for each pixel across a frames buffer for each new frame you get, why don't you just keep track of the last frame that was removed, and simply do, for each target value:
value = ((value * nFrames) - valueFromOldFrame + valueFromNewFrame) / nFrames
This would always run in O(1)
, regardless on the number of frames you're averaging across. The point is that no matter how many frames you're considering, at each iteration there would always just be the two values at the start/end of the frames buffer that would be changing, all the others would remain constant anyway. I don't see why you'd want to actually recompute that partial sum every time.
First of all, thanks a million for this detailed reply @Sergio0694!
Ad Frame averaging - in our specific case, the amount of frames to average across has to be kept low. It is a balancing game between removing noise from the signal and keeping the sandbox interactive. If we were to average across 300 frames (10 seconds given Kinect's 30FPS), there would be ghosting issues with users' hands' movements being stored in the buffer.
Your suggestion to simplify the approach seems to be most optimal. Thanks a lot!
Ad. Gaussian blur - don't know whether it's asking too much, but would you be willing to take this challenge and adapt your current Gaussian blur kernel to our needs? Kinect spits out ushort[]
values but we cast these to int
anyway, so the HLSL limitation you mentioned shouldn't matter in this case.
Hey @mariuszhermansdorfer - no worries, and yeah I can write a gaussian blur sample for you, no problem. It's already there in my sample anyway, so it'd just need some minor tweaking. I could make some tests on both the CPU and GPU actually, just in case.
A couple questions:
1) What's the gaussian blur radius you expect to use? Like, something very small like 3 or something bigger like, say, 20 or 30?
2) Since you mentioned you'd cast that array to int
anyway, do you mean I can just forget about the data being in ushort
originally and just take int[]
as both input and output?
Thanks a lot @Sergio0694!
To answer your questions:
The expected radius will probably be around a few pixels. We want to get rid of the inaccuracies of the sensor but still keep the detail in the box. You can see the current implementation in action here.
The output from the Kinect is an ushort[]
it is then enqueed in the renderBuffer
renderBuffer.Enqueue(KinectController.depthFrameData);
then, we loop over all the pixels to get their elevation
depthPoint = KinectController.depthFrameData[i];
and finally we can cast these to float
to form a Point3f
tempPoint.Z = (float)((depthPoint - sensorElevation) * -unitsMultiplier);
So we operate only on ushort
in terms of precision, but have to cast to float
in the end to be able to form a Point3f
type. Don't know what's the most efficient way to approach the casting.
Another thing is the order of operations. Currently it works as follows:
A more correct approach would be to:
The reason for our current approach was to avoid introducing a separate loop just for the averaging. Ideas on how to optimize this are more than welcome.
@mariuszhermansdorfer alright, I've created a new branch and added my implementation of a gaussian blur, using parallel processing and linearly separable kernels.
The processor applies the effect on a float[]
buffer, so the idea would be to do the following:
int[]
(see my previous snippet)O(1)
method I described before. This will result in a float[]
arrayI think that just with that O(1)
approach for the frame average and the parallel processign in the gaussian blur, you should get a pretty noticeable boost in your scenario.
Hope this helps!
Thanks @Sergio0694! A very noob question - when I opened your GaussianBlurProcessor class, my Visual Studio immediately complained:
Error CS0103 The name 'Unsafe' does not exist in the current context
How does one 'unlock' these?
@mariuszhermansdorfer It's because that class is not available in .NET Framework 4.5, so I had to install an additional NuGet package for it. Just rebuild the solution, and VS will install that package for you.
@Sergio0694
The issue was this guy: System.Runtime.CompilerServices;
Thanks a lot. Rebuilding solved it.
I've made some progress in the branch you started @Sergio0694.
The upcasting to int
idea is brilliant and allows us to trim the depth array to the exact dimensions needed in the same loop. Your Gaussian blurring method works well too, but it turns out it is slightly slower than what we had before. Short video here.
The other one is adopted from this repo, and here is some theory behind it.
You mentioned it could potentially be made even faster if this loop would be optimized. When I follow your suggestion and try to replace it with
source.AsSpan().CopyTo(destination);
VS again complains I'm missing an assembly reference. According to the docs, this method should be part of the System namespace, no?
Any other ideas on how to make things snappier?
Implemented averaging across frames as well. Thanks for providing some inspiration @Sergio0694! I had to add some checks for edge cases, but the basic principle you suggested remains - add one element to a queue
, remove one from the bottom if necessary, divide by number of elements in the queue
.
Having done some more testing, I think I'd like to stick to your Gaussian blurring method. It gives users finer control over the radius and is almost equally fast as the other one.
Hey @mariuszhermansdorfer - thanks for adding me as a contributor! 😊 So, that gaussian method you linked looks pretty interesting anyway, thanks for sharing! And glad to hear that my other implementation still works for you, yeah that was designed to both be fast and entirely customizable, so that you can use any radius you want.
As for that AsSpan()
API, it's in the System
namespace but it's not in the default assembly on .NET Framework, that's why I've added a reference to System.Memory
, you'll find it there 👍
That's the same package that also includes Span<T>
and other high-performance APIs.
I'm not sure I'm following your paragraph about the queue for the frame average, are you actually using a queue or was that just a figure of speech? Because with a queue you'd still be doing the operation in O(n)
, what I suggested was just to keep the actual averaged value, and mutate it across multiple pixels, adding and subtracting the first and last values in the "view" for the current pixel.
Is that what you meant there?
EDIT: also, is this method called just once or once per frame?
Thanks for the explanation regarding the AsSpan()
API. I'll have a look and see what it does. We're slowly reaching a point of diminishing returns, though. This part of the code is already pretty fast - shaving off another 20% is not going to result in noticeable speed-up. Probably worth investing the time in adding more features or fixing bugs :)
Apologies for not being to precise - I do use a queue
but only to keep track of the last item to dispose of. It's just more practical when there is a FIFO construct to begin with.
I couldn't really figure out how to implement your logic one-to-one. It would be easier if we were dealing with an array of constant size and a constant number of frames to average across. In our case, the array length can change between frames (users trimming the extent of the scanning field interactively) and the number of frames to average can be adjusted live as well.
As a result I ended up storing a running sum for every pixel in a separate array. On each tick I add current depth pixel's value, remove last one from the queue
and divide by the length of our render buffer. If any of the above mentioned variables change, everything gets reset.
Would be curious to hear your opinion.
The method you are referring to is called every 20 ms. That's how user input gets translated to our logic and we are able to push stuff back to the 3d modelling environment.
Implemented in the latest commit. #28 I ended up using your algorithm after all, @Sergio0694. It does the job within 1ms and gives users a lot of flexibility in tweaking the radius. Thanks a lot for your contribution!
Short video of the results
Hey @mariuszhermansdorfer - that's great, I'm glad I could help! 😊
Test different smoothing algorithm (gaussian blur, weighted average, etc) and implement the best to work directly on the depth array.