microsoft / HoloLensForCV

Sample code and documentation for using the Microsoft HoloLens for Computer Vision research
MIT License
472 stars 156 forks source link

Understanding the Transformations between Depth and HD Camera to Align Content #119

Open LisaVelten opened 4 years ago

LisaVelten commented 4 years ago

Hi everyone,

First I postet my question beneath a closed issue but I though it might be useful to open up a new issue. I am having the problem that I cannot get my depth images and HD images aligned. The picture below visualises my problem. I try to align a calibration pattern. First I filter the 3D depth points, which belong to the calibration pattern, then I project these points into the HD image.

CalibrationPattern_ViewPointLeft

To figure out the problem I investigated the meaning/content of the transformation matrices in some more detail. In the following I outline my understanding of the transformation matrices. Then I describe how I try to align my images. I would appreciate your help very much!

1. CameraCoordinateSystem (MFSampleExtension_Spatial_CameraCoordinateSystem) In the example HoloLensForCV this coordinate system is used to obtain the transformation "FrameToOrigin". The FrameToOrigin transformation is obtained by transforming the CameraCoordinateSystem to the OriginFrameOfReference. (line 140-142 in MediaFrameReaderContext.cpp)

I still do not exactly know what is described by this transformation. What is meant by "frame"?

Through experimenting I found out that the translation vector changes when moving. In fact, the changes do make sense: If I move forward, the z-component becomes smaller. This agrees with the coordinate system in the image below. The z-axis is pointing in the opposite direction of the image plane.

coordinatesystems

The same applies for moving left or right: moving right makes the x component increase. The y component is about stable. This makes sense as I am not moving up or down. What I am really uncertain about is the rotational part of the transformation matrix. The rotational part is almost an Identity Matrix. The rotation of my head seems to be contained in the CameraViewTransform, which I describe in the second point.

As far as I understand, the FrameToOrigin Matrix looks as follows: [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, x, y, z, 1]

For me the "FrameToOrigin" seems to describe the relation between a fixed point on the HoloLens to the Origin (The Origin is defined each time the app is started, this helps to map each frame to a frame of reference). In the Image above the Origin is probably the "App-specific Coordinate System".

2. CameraViewTransform (MFSampleExtension_Spatial_CameraViewTransform ) The CameraViewTransform is directly saved with each frame (In contrast to FrameToOrigin, no transformation is neccessary).

The rotation of the head seems to be saved within the rotational part of this matrix. I tested this by moving my head around the y-axis. If I turn about 180 ° to the right around my y-axis, the rotational part looks as follows: [0, 0, 1, 0, 1, 0, -1, 0, 0]. This corresponds to a 180° rotation around the y-axis - what we expect..

The translational part seems to stay about stable. This would make sense if the translational part described the translation between the fixed point on the HoloLens and the respective camera (hd or depth). However, I would expect the translational part to stay exactly equal. This is not the case. The translational part is only "about" equal and not exactly.

If I do not turn my head (rotational part is an Identity Matrix) the CameraViewTransform looks as follows:

CameraViewTransform for HD Camera [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0.00631712, -0.184793, 0.145006, 1]

CameraViewTransform for Depth Camera [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0.00798517, -0.184793, 0.0537722, 1]

So the CameraViewTransform seems to capture the rotation of the users head. What is captured by the translational part? If the translational part is the distance between a fixed point on the HoloLens and the respective camera - why is the translational part not always exactly equal?

3. CameraProjectionTransform (MFSampleExtension_Spatial_CameraProjectionTransform) This transformation is described on the following github page: https://github.com/MicrosoftDocs/mixed-reality/blob/5b32451f0fff3dc20048db49277752643118b347/mixed-reality-docs/locatable-camera.md

However, what is still unclear: What is the meaning of the terms A and B?

My aim is to map between the depth camera and the hd camera of the HoloLens. To do this I do the following:

  1. I record images with the Recorder Tool of the HoloLensForCV sample
  2. I take a depth image and look for the corresponding hd image by checking the timestamps.
  3. I use the unprojection mapping to find the 3D points in the CameraViewSpace of the Depth Camera.
  4. I transform the 3D points from the Depth Camera View to the HD Camera View and project them onto the image plane. I use the following transformations: [Pixel Coordinates.x,Pixel Coordinates.y , 1, 1] = [3D depth point,1] inv(CameraViewTransform_Depth) FrameToOrigin_Depth inv(FrameToOrigin_HD) CameraViewTransform_HD * CameraProjectionTransfrom_HD

These Pixel Coordinates are in the range from -1 to 1 and need to be adjusted to the image size of 720x1280. This is done as follows: x_rgb = 1280 (PixelCoordinates.x + 1) / 2; y_rgb = 720 (1 - ((PixelCoordinates.y +1)/2));

Result: When transforming my detections from the depth camera to the hd image camera, the objects (In this case the Calibration Pattern) are not 100 % aligned. So I am trying to figure out where the misalignment is coming from. Am I understanding the transformation matrices wrong or has anyone experienced similar problems?

The problems might occur if the spatial mapping of the HoloLens is not 100% working correctly. This might happen If the HoloLens cannot find enough features to map the room. Thus, I tested my setup in different rooms. Especially, in smaller rooms with more clutter in the background (such that the HoloLens can find more features to map the room). However, the problem still occurs. As I outlined above, the rough appearance of the transformations seems to be correct. I do not have any idea of how to test the transformation matrices further to grasp the problem.

I would appreciate your help very much! Thanks a lot in advance! Lisa

LisaVelten commented 4 years ago

At Argo-1, Maybe your working on a similar Problem, so I want to present my solution.

As so often - After you have been working on the problem for a few days, you do not know what else to try out until you ask for help and then you find the solution…

So my specific Problem was that my creation of the 3D Depth was still in mm. However, the transformation matrices are in m though. The second problem was that you need to multiply the depth values by a minus sign. Comment in case it is still not working for you.

I solved the main problem now. However, I still would like to gain some further knowledge about the content of the transformation matrices... So i do not close the issue at the Moment.. :-)

cxnvcarol commented 4 years ago

glad to hear you fix it, could you share your code maybe? thanks in advance.

LisaVelten commented 4 years ago

It is Matlab code for now if that is okay for you. Later I will transfer the code to C++. Let me structure the code a bit better and then I will post it if that is okay.

Do you have a further understanding of the content of the matrices?

cyberj0g commented 4 years ago

Hi all, did you see my C++ implementation? The transform matrices sequence is a standard 2D -> 3D -> 3D -> 2D mapping, with additional unprojection structure to go from depth 2D to depth camera unit plane 3D at step 1.

LisaVelten commented 4 years ago

Hi, yes, your implementation was actually quite a good help for me. Thank you! Do you have any further knowledge regarding the content of the matrices? Everything I tried was based on these examples and information on the forum as I cannot find any detailed information from microsoft..

argo-1 commented 4 years ago

At LisaVelten,

Great to hear that you fixed the issue! I am working with the VL cameras for 3D reconstruction with stereo vision.

I've modified the StreamerVLC tool ( https://github.com/microsoft/HoloLensForCV/pull/118 ) to send synchronized stereo frames from a background task using websockets and ROS, thus allowing for an immersive app in the foreground. I am yet to use the depth and RGB camera streams, but it's definitely been on my mind as something to examine in the future for my project.

Though - by going through the repo code - I've been able to get the data I've required till now, I was yet to have a more intuitive understanding of the provided transformations. Thank you for your comprehensive detailing! I shall keep following this issue, and others, in case more useful tidbits come this way.

LisaVelten commented 4 years ago

Dear All, I am very sorry for my late reply, but now I finally prepared a script (Matlab Code), which describes the mapping between depth and rgb. I hope this might help to get some better understanding.

CalibrationPattern_depth2rgbMapping.zip

zhuisa commented 3 years ago

@LisaVelten hi, I have saom question for help,how to get current view depth image and view/projection matrix so that I can place some holographic in special position

tghoshmo commented 3 years ago

Dear All, I am very sorry for my late reply, but now I finally prepared a script (Matlab Code), which describes the mapping between depth and rgb. I hope this might help to get some better understanding.

CalibrationPattern_depth2rgbMapping.zip

Could you please explain the following lines?

% u,v Coordinates in Camera Coordinate System ulc_calibPattern_u = (ulc_calibPattern_x_rgb_minusone2one - (0.0701278(-1)))/2.43247; ulc_calibPattern_v = (ulc_calibPattern_y_rgb_minusone2one - ((-0.0997288)(-1)))/4.31968;

What do the numbers indicate? Where do they come from?

Thanks.

NTUZZH commented 1 year ago

Hi all,

I'm creating HoloLens2 software that detects corners in an image and then shows their actual location in the world. However, there may be slight deviations between the projected points and their actual location, which increase with distance from the user. As following figure shows.

image

Here's my code snippet for projecting the detected corners: I convert the corners from image to screen coordinates, then to NDC for unprojection. Next, I use the inverse of the projection matrix to convert the points from NDC to Camera space and then from Camera space to World space using the camera-to-world matrix. Finally, I instantiate the detected points using a Ray and RaycastHit.

image

For getting the cameraToWorldMatrix and projectionMatrix, i referred to XRIV work.

From my side, i guess it may be the reason that the depth information is not estimated properly or the value of cameraToWorldMatrix and projectionMatrix are not correct. However, i have made efforts on this two reasons, but failed.

May dear friends could give me some insights about how to mitigate this projecting devation, i show my great appreciation here in advance!