sjy234sjy234 / KinectFusion-ios

demo KinectFusion that running on ios
MIT License
46 stars 10 forks source link

Coordinate systems in shader #7

Closed megamanzero23 closed 10 months ago

megamanzero23 commented 11 months ago

Hi there

I wanted to quickly clarify the coordinate systems being used in your pipeline.

1. Could you please let me know if I understand your sign flips (-d and -y) correctly? It seems like the points are in the camera/sensor coordinate system where y points down, z points into the screen unlike the real world coordinate system in the example code (depth->points in real world coordinate) below?

In your shader fuDepthToVertex, I see that you flip the sign for z and y

        outVertexMap[outvid]=d*(u-centerU)*focalInvert;
        outVertexMap[outvid+1]=-d*(v-centerV)*focalInvert;
        outVertexMap[outvid+2]=-d;

In fuICPPrepareMatrix , you also flip the signs again

    float d= -currInPreFrameVertex.z;
    int u= round(currInPreFrameVertex.x*intrinsic_XYZ2UVD.focal/d+intrinsic_XYZ2UVD.centerU);
    int v= round(-currInPreFrameVertex.y*intrinsic_XYZ2UVD.focal/d+intrinsic_XYZ2UVD.centerV);

image**

Usually when I unproject a depth map into point cloud in the real world coordinate system, I'd use this equation where y points up, z points away from the screen.

  uint2 pos;
    pos.y = vertexID / depthTexture.get_width();
    pos.x = vertexID % depthTexture.get_width();

    // depthDataType is kCVPixelFormatType_DepthFloat16
    float depth = depthTexture.read(pos).x * 1000.0f;

    float xrw = (pos.x - cameraIntrinsics[2][0]) * depth / cameraIntrinsics[0][0];
    float yrw = (pos.y - cameraIntrinsics[2][1]) * depth / cameraIntrinsics[1][1];

    float4 xyzw = { xrw, yrw, depth, 1.f };

    out.clipSpacePosition = viewMatrix * xyzw;
    out.coor = { pos.x / (depthTexture.get_width() - 1.0f), pos.y / (depthTexture.get_height() - 1.0f) };
    out.depth = depth;
    out.pSize = 5.0f;
image

2. Additionally, I see that you chose to set the invalid to a very high value. Can I ask why you choose that big value instead of NaN? Is it for efficient reason like having numbers would make the matrix solver run better than NaN? Or more for easier debugging to better spot errors?

        //for invalid depth data, set vertex to unreasonable value
        outVertexMap[outvid]=10000000.0;
        outVertexMap[outvid+1]=10000000.0;
        outVertexMap[outvid+2]=10000000.0;

3. In your pipeline, you also use simd::float4x4 m_frameToGlobalTransform; simd::float4x4 m_globalToFrameTransform; do these follow the right-hand rule like ARKit's world frame where y points up, z points into the screen, x points right?

https://developer.apple.com/documentation/arkit/arkit_in_ios/configuration_objects/understanding_world_tracking In all AR experiences, ARKit uses world and camera coordinate systems following a right-handed convention: the y-axis points upward, and (when relevant) the z-axis points toward the viewer and the x-axis points toward the viewer's right.

===

  1. And lastly (if you have time), I see that frameToGlobalTransform and globalToFrameTransform seem to keep track of the transformation matrix across multiple ICP iterations. In my head, I can see that two point clouds approach closer and closer as more matrices are multiplied in the icp iteration loop for(int it=0;it<iteratorNumber;++it), but I wonder if it makes sense to reset these two variables in the next processFrame() to avoid drift if I only use frame-to-frame instead of frame-to-model like what you're doing? I have seen in some occasions, the transformation matrix could be wrongly solved so i don't want to multiply by the wrong matrix or stop the iteration when the point clouds are already converted to save compute. I'd appreciate your inputs. Thank you!
sjy234sjy234 commented 11 months ago

Well, I recommend you check it out by yourself from 1~3, because it doesnot matter either you use a right hand or left hand system. And it is easy to transform from one system to another if needed.

For question 4, original KinectFusion doesnot solve the problem. You may check Bundle Adjustment to use multiple-frame optimization to achieve drift-free poses. Resetting transforms would get worse results and might fail on larger poses, because it (iterating frame by frame) plays as a preassumption of KinectFusion iteration.

megamanzero23 commented 11 months ago

Well, I recommend you check it out by yourself from 1~3, because it doesnot matter either you use a right hand or left hand system. And it is easy to transform from one system to another if needed.

For question 4, original KinectFusion doesnot solve the problem. You may check Bundle Adjustment to use multiple-frame optimization to achieve drift-free poses. Resetting transforms would get worse results and might fail on larger poses, because it (iterating frame by frame) plays as a preassumption of KinectFusion iteration.

Thank you for your quick reply!

I'm still new to CG so I'm just making sure my intuition is correct as I read through your code and also trying to understand Apple's documentation on depth map as well:

In the ICP shader, the z component of the vertices is negated again (the currentVMap already has negative z from the DepthToVertex steps) so I'm guessing negating z here would make the vertices to be in the global/world space where z is positive for the ICP calculation (because the the extrinsic matrix is the camera transformation in the global space).

sjy234sjy234 commented 11 months ago

Actually, I donot remember such detailed implementation, since it has been a long time. I was new to CG as well when working on this project. The world space might not be strictly formal. Your interpretation seems reasonable. You could just save or print the values to calculate offline to validate your assumptions.

megamanzero23 commented 11 months ago

Thank you. I'm curious how you debugged your metal shaders when you worked on these kind of CG project (I'm eager to learn good practices) since there are many components and transformations to keep track of. I like how you save the depth data into *.bin file so that you have consistent results through multiple trials and errors. When you said you calculate offline, I'm guessing either Python or Matlab since these are probably quicker at processing depth images.

However, I imagine components like ICP reduction kernel shader is a bit harder to debug since there are multiple thread groups to manage. On that note, I see that you used Metal Performance Shaders MPSMatrixMultiplication to reduce A^(T)Ax=A^(T)b, but in another git issue, you recommended to implement a custom shader to do matrix multiplication.

I don't know if you can share your matrix mul version, but if you used any resources (blog, book), I'd appreciate it as a beginner. I am either going to attempt writing a naive multiplication (which probably is not gonna beat MPS) or adapt CUDA's matrixMul since it's open-sourced (https://github.com/NVIDIA/cuda-samples/blob/master/Samples/0_Introduction/matrixMul/matrixMul.cu).

// super simple matrix mul
    if ((id.x < col_dim_x) && (id.y < row_dim_x)) {
        // id.x is the column index of the result matrix.
        // id.y is the row index of the result matrix.
        const uint index = id.y*col_dim_x + id.x;
        float sum = 0;
        for (uint k = 0; k < inner_dim; ++k) {
            // index_A corresponds to A[id.y, k]
            const uint index_A = id.y*inner_dim + k;

            // index_B corresponds to B[k, id.x]
            const uint index_B = k*col_dim_x + id.x;

            sum += A[index_A] * B[index_B];
        }
        X[index] = sum;

Lastly, I'm also wondering if you have tried Accelerate to calculate on the CPU since the max values would be 640x480 = 307200 which doesn't seem massive for the CPU compute, though I understand there can be a slight overhead in moving data between the GPU and CPU.

thank you again for your time and wisdom! More than happy to push a PR if my matrix reduction can run a bit faster than MPS :)

sjy234sjy234 commented 11 months ago

I said calculating offline means merely that if you wanted to figure out the transformations. My bin file is just for demonstration purpose rather than debugging. I already have released debug shaders collected in debug directory I started step by step. There is no very efficient debug tools. Usually I just render it out to see what happened.

The matrix multiplication could be splitted into two steps: multiplication element by element, then sum. I just wrote two shaders to handle them, sacrificing memory for efficiency, since in this senario memory cost is tolerant.