Azure Depthmap Distortion Revisited

mariuszhermansdorfer commented 3 years ago

@philipbelesky, @BarusXXX let's start a new thread about this.

It took me a while, but I think I've finally wrapped my head around how Kinect Azure understands the world. Here are some important facts:

The depth camera is physically tilted by 6° in relation to the RGB camera (source)

Transforming from depth camera space to RGB camera space results in the amount of pixels equivalent to current RGB camera mode. This is overkill for our use case, especially if we consider, that the resulting pixels are interpolated, hence don't add any meaningful information (reference)
Transforming from depth camera space directly to a point cloud keeps the amount of pixels and results in undistorted 3D geometry. Under the hood, a lookup table is generated for each 2D depth pixel with a translation vector defining the XYZ coordinates of resulting 3D points. (reference)
Unfortunately, there is a lot of jitter in the depth signal, which results in pixels 'jumping' on the XY plane. To counter for it, we previously used a fixed rectangular grid generated when 'sensorElevation' was set, and subsequently changed only the Z values of this grid. Kinect Azure doesn't have rectangular depth output which makes our task less trivial but we can 'undistort' its depth image by leveraging the lookup table mentioned above (reference)
Once 'sensorElevation' is set (either manually or through the calibration process) we need to calculate the 'ideal' XY spacing of our 3D pixels. The aforementioned camera tilt needs to be taken into consideration here. We generate another lookup table, where we store individual pixel's vertical deviation per unit of Z elevation. Then we create a 'fake' depth map with 'ideal' uniform elevations (this doesn't mean that individual pixel values are identical, but they correspond to a 'flat' plane) and pass it to our transformation logic to obtain 'ideal' XY coordinates for given 'sensorElevation'.

Once we have undistorted points on XY, we need to counter for the 6° tilt. Trigonometry helps us get the job done. We leverage a sine function taking the Y axis as hypotenuse.

Side view before:

Side view after:

There are some outliers, but it's close enough to accept as final solution:

Now, we're more or less done. We can keep the lookup tables as long as the camera depth mode (wide/narrow & binned/unbinned) and sensorElevation remain unchanged. Computationally, this allows us to save a lot of ressources per cycle, since X and Y coordinates of our mesh vertices remain constant and we only need to adjust the Z values to counter for the 6° tilt.

I have a working prototype branch on my machine. Will clean up the code a bit and publish to the repo on Monday.

mariuszhermansdorfer commented 3 years ago

Here are a few additional screengrabs with the aspect analysis activated to better visualize what is going on.

We can clearly see how the amount of distortion increases towards the edges of the depth map. That's why smart people at Microsoft decided to clip the signal to a hexagonal pattern to avoid unreliable readings. The quasi-horizontal streaks at teh edges correspond to '0' values in the depth map.

To obtain a rectangular image we need to trim off a significant portion from the sides. Approximately 130 columns from left/right and 20 rows from top/bottom. Interestingly, this leaves us with a lower resolution depth image than the Kinect V2. Azure (trimmed): 380x536 = 203,680 pixels Kv2: 512x424 = 217,088 pixels

Maybe using the wide field-of-view mode will help here. I'm skeptical, though, because with increased resolution, comes a massive increase in area covered. Therefore, a pixel per mm ratio will probably be worse in the WFOV compared to the NFOV mode.

While working with undistorted coordinates, the resulting mesh doesn't follow a rectangular outline. The underlying pixels are still ordered in a nice grid, but their corresponding XY coordinates are radially expanding outwards:

A close-up look of the underlying mesh from the Top view:

Also, since our elevation banding logic hasn't been updated yet, we can clearly see the 6 degree tilt of the depth camera in action. The mesh pipeline is being fed corrected Z values, while the coloring function (still) operates on raw data from the depth sensor.

I'm really puzzled about why the depth camera is tilted in relation to the RGB sensor. I simply fail to see the advantage of that setup. Any ideas?

philipbelesky commented 3 years ago

I don't have any ideas for solutions, but had noticed the issues you mentioned regarding the FoV cropped and the tilt. I hadn't done the math in terms of the resolution though - that a rectangular crop ends up back at a Kinect2 resolution very much undercuts the value of the newer device.

The last time I ran a sandbox setup I ended up using the Wide FOV mode and positioned the camera low enough to cover the whole table (0.9m x 1.8m). However, it's worth noting that having the camera that close to the table also means that depth data towards the far edges really suffers from occlusion, e.g. any 'high' elements near the edge will obscure lower ones.

I wonder if a more viable path for a 'high res' setup is either to shift to a circular table geometry or to employ a multiplexed pair of devices.

mariuszhermansdorfer commented 3 years ago

Good points, @philipbelesky. Working with the WFOV and having the camera low above the table makes overhead projection trickier as well. I can imagine, that the sensor could get in the projector's FOV and cast shadows onto the sand. Having 2 chained Kinects in NFOV would definitely help increase the resolution and allow us to work with bigger models. Point clouds should be straight forward (the SDK has existing definitions for that purpose) but meshing would probably require significant rework.

I can imagine the hardware setup looking quite bad-ass, though, with two Kinects on the sides and a centrally mounted projector + a screen.

On another note- seeing the undistortion matrix got me thinking whether our approach with Kinect for Windows was correct. We assume a perfectly rectangular distribution of pixels coming from the depth camera and only adjust the XY spacing based on sensor's elevation: https://github.com/mariuszhermansdorfer/SandWorm/blob/c21d745222f248e18dba1d46cdfd692a3bae0e86/SandWorm/Core.cs#L77

But also in the case of Kinect for Windows, there has to be some barrel-like distortion coming from the depth camera. Wouldn't it be a better idea, to compensate for that as well? It would result in a butterfly-shaped mesh like pictured above, but would hopefully result in more correct geometry.

I always had a lot of issues aligning the projected image with the sand model and kind of assumed, that it has to do with the projector. It might be, that it was our simplified way of generating the mesh, which was the root cause here.

mariuszhermansdorfer commented 3 years ago

@philipbelesky, @BarusXXX here is the newest branch for you to play with: https://github.com/mariuszhermansdorfer/SandWorm/tree/feature/SandWorm2-0

I did some additional testing of how to address the 7 degrees tilt correction and am coming to a conclusion, that the best way of removing any distortion is by re-implementing the calibration procedure from before. Users would first align the hardware (mount the sensor over a clean table, trim the excessive rows/columns in the software) and hit the 'Calibrate' button. This generates a lookup table for every pixel with deviation values from the elevation as measured in the center. Essentially a way of flattening the table digitally and removing all hardware and software misalignments.

Relevant bits of code:

Here we average over 60 frames and generate the lookup table: https://github.com/mariuszhermansdorfer/SandWorm/blob/c21d745222f248e18dba1d46cdfd692a3bae0e86/SandWorm/Components/SetupComponent.cs#L158-L160

Here we apply the correction: https://github.com/mariuszhermansdorfer/SandWorm/blob/c21d745222f248e18dba1d46cdfd692a3bae0e86/SandWorm/Components/BaseKinectComponent.cs#L139

And here is the functionality in action: https://www.youtube.com/watch?v=Hl-MdGNDagM

I like this idea for its relative simplicity, it does come with some drawbacks, though:

Users must have the option to scan a clean table for this approach to work (sometimes models are difficult to move around)
Changing anything about the hardware (sensor position, angle etc.) or software setup (amount of rows/columns to trim) requires recalibration
Triggering it by accident can lead to confusion among users (whatever was scanned before is now treated as a flat area)
Difficult to undo
Persistent storage of results and re-instantiating across sessions can be tricky

Please test the newest branch and let me know your what you think.

mariuszhermansdorfer / SandWorm

Azure Depthmap Distortion Revisited #75