nianticlabs / simplerecon

[ECCV 2022] SimpleRecon: 3D Reconstruction Without 3D Convolutions
Other
1.28k stars 120 forks source link

curious about extrinsics matrix #7

Closed MaybeOjbk closed 1 year ago

MaybeOjbk commented 1 year ago

extrinsic matrix is the transformation matrix that transfers points of world coordinate system into camera coordinate system, however I found your code in cost_volume.py and geometry_utils.py, the notes mixed up extrinsic matrix and c2w matrix. the class of Project3D in geometry_utils.py, you write " Projects spatial points in 3D world space to camera image space using the extrinsics matrix cam_T_world_b44 and intrinsics K_b44." however the code is "P_b44 = K_b44 @ cam_T_world_b44 , cam_points_b3N = P_b44[:, :3] @ points_b4N" and cam_T_world_b44 is src_poses in cost_volume.py , which is src_cam_T_cur_cam in depth_model.py, and "src_cam_T_cur_cam = src_cam_T_world @ cur_world_T_cam.unsqueeze(1)",
but why isn't " src_world_T_cam @ cur_cam_T_world.unsqueeze(1)"

mohammed-amr commented 1 year ago

Apologies for the confusion. src_poses in cost_volume.py should really be called src_extrinsics and inv_src_poses should be src_poses. The naming error is an oversight on our part during the refactor. With this naming correction, the nomenclature logic should follow as in the code. Do point out if we've missed anything!

The code logic flow is identical with and without the correction, as it's only a naming oversight.

I'll put in these changes today, and I'll ping here when I do so you can close this issue if resolved. Thanks!

mohammed-amr commented 1 year ago

Hello hello, I've pushed a variable naming fix. Thanks again for pointing this out and sorry for the confusion!

Hopefully it should all be clear now, and to confirm: the cost_volume.py uses src_extrinsics to directly go from world_points in the current camera coordinate frame to source points, and so it gets passed src_cam_T_cur_cam.

MaybeOjbk commented 1 year ago

I read your code, maybe I didn't explain my confusion. But here I want to say that, when warped source images into reference view, we use inverse warping, which means:

  1. we project pixel of images in ref camera space with K_ref_inv and depth in ref view, the points are not in world space, so we need to transfer these points in ref camera space to world space.
  2. we use ref_c2w (inverse of extrinsic matrix) to transfer these points to world space.
  3. we also need to transfer these points in world space to src camera space using src_w2c (extrinsic matrix)
  4. then we should project these points in camera space in 3D to 2D pixel space using K_src and so we also need to divide the third dimension of the points, and also need to normalize to [-1,1]
  5. we use F.gridsample to sample features from src features using these 2D pixels

however in your cost volume.py , line 564, these points are 1. points in ref_camera space and line 569 these points are 3 points of src_camera space. so you don't have world_points, which means your calculation of rays to world points for current frame and source frames are wrong. The correct method is using the right points in wrold space, and w2c matrix to calculate the direction of ref and src views,

mohammed-amr commented 1 year ago

I think you may have misunderstood which coordinate frame these rays are supposed to be in. We've made the decision for all rays to be in the coordinate frame of the reference/current camera. The transforms have everything implicitly baked in, so there's no need for explicit transforms to the world and back.

Our goal isn't to compute these rays in the world coordinate frame, but rather in the coordinate frame of the reference camera. The latter is agnostic to whatever position and orientation these cameras are in the world for any particular scan. We care about the ray's direction for each location in the cost volume w.r.t the reference camera, because ultimatetly that information - trimmed away from absolute world location and orientation - is what matters for matching. Adding to this, keeping absolute world location in the rays might lead to overfitting, since ScanNet's training set only has 1200 scans.

There's a comment about this in the code at 562-564:

            # backproject points at that depth plane to the world, where the 
            # world is really the current view.
            world_points_b4N = self.backprojector(depth_plane_b1hw, cur_invK)

These world_points_b4N are in the reference view's coordinate frame. The ray at the center of the image will line up with the axis pointing outward from the current camera's coordinate frame.

cur_points_rays_B3hw[0,:,48,64] # roughly center
tensor([0.0077, 0.0018, 1.0000], device='cuda:0')

For source frame rays, we use a relative pose - src_poses - that computes the rays for these so called world_points_b4N - which are really in the coordinate frame of the reference camera - for each source frame.

For that we use src_poses which is cur_cam_T_src_cam, a transform from the source camera frame to the reference camera frame. For computing rays you want the pose matrix - world_T_cam; if you replace world with cur_cam and cam with src_cam, you'd have the same transform. The transform from the current cam to the world then back to the source camera is already baked in.

Don't take my word for it, let's try it empirically:

# world point we're debugging in the current cam coordinate frame:
world_points_B4N[0,:,0]
tensor([-0.1656, -0.1252,  0.2500,  1.0000], device='cuda:0')

# current camera ray in the current cam coordinate frame:
cur_points_rays_B3hw[0,:,0,0]
tensor([-0.5095, -0.3852,  0.7694], device='cuda:0')

# ray for this point for the first source frame, again in the current cam coordinate frame
src_points_rays_B3hw[0,:,0,0]
tensor([-0.0565, -0.3497,  0.9351], device='cuda:0')

Let's comptue it the verbose explicit way by first transferring the points to source frame:

# world point we're debugging in the current cam coordinate frame:
world_points_B4N[0,:,0]
tensor([-0.1656, -0.1252,  0.2500,  1.0000], device='cuda:0')

# transferring it over to the absolute world coordinate frame using the raw current poses from the dataloader.
abs_world_points_B4N = actual_cur_poses_B44 @ world_points_B4N
abs_world_points_B4N[0,:,0]
tensor([-3.6673, -4.5011, -0.2396,  1.0000], device='cuda:0')

# get rays for these points from each src frame in the absolute world reference frame
naive_world_src_points_rays_B3N = get_camera_rays(
                                    actual_src_poses_B44,
                                    abs_world_points_B4N[:,:3,:],
                                    in_camera_frame=False
                                )
naive_world_src_points_rays_B3N[0,:,0]
tensor([ 0.7002,  0.4983, -0.5113], device='cuda:0')

# now transform these to the current camera frame so we know which way they're pointing in that coordinate frame, where all other points and rays should be.
naive_src_points_rays_B3N = actual_cur_extrinsics_B44[:,:3,:3] @ naive_world_src_points_rays_B3N
tensor([-0.0565, -0.3497,  0.9351], device='cuda:0')

# this is the same as src_points_rays_B3hw[0,:,0,0] as computed in the current reference frame (like in the code, and here in the previous code block).

# computing the mean error between both:
torch.abs(naive_src_points_rays_B3hw - src_points_rays_B3hw).mean()
tensor(6.4854e-07, device='cuda:0') # well within tolerance

If you want to reporoduce this yourself, suffle along actual_src_poses and actual_cur_poses from the DepthModel's forward to build_cost_volume. Then put this at line 676:

            actual_cur_poses_B44 = actual_cur_poses.expand(num_src_frames, 4, 4)
            actual_cur_extrinsics_B44 = torch.inverse(actual_cur_poses_B44)
            abs_world_points_B4N = actual_cur_poses_B44 @ world_points_B4N

            actual_src_poses_B44 = tensor_bM_to_B(actual_src_poses)
            actual_src_extrinsics_B44 = torch.inverse(actual_src_poses_B44)

            naive_world_src_points_rays_B3N = get_camera_rays(
                                    actual_src_poses_B44,
                                    abs_world_points_B4N[:,:3,:],
                                    in_camera_frame=False
                                )

            naive_src_points_rays_B3N = actual_cur_extrinsics_B44[:,:3,:3] @ naive_world_src_points_rays_B3N

            naive_src_points_rays_B3hw = naive_src_points_rays_B3N.view(-1, 
                                                    3, 
                                                    self.matching_height, 
                                                    self.matching_width,
                                                )

We can further verify the rays are correct by verifying the angles between rays are the same in both the absolute world coordinate system and the current camera coordinate system:

# transfer rays over to the world from the coordinate frame of the current camera
abs_world_cur_rays_B3N = actual_cur_poses_B44[:,:3,:3] @ cur_points_rays_B3N
# compute angles in absolute world coordinate system
world_angles = F.cosine_similarity(
                        abs_world_cur_rays_B3N, 
                        naive_world_src_points_rays_B3N, 
                        dim=1, 
                        eps=1e-5
                    )
tensor([[ 0.8830,  0.8819,  0.8809,  ...,  0.9618,  0.9625,  0.9633],
        [ 0.8325,  0.8306,  0.8288,  ...,  0.9489,  0.9498,  0.9507],
        [ 0.7592,  0.7565,  0.7538,  ...,  0.9256,  0.9266,  0.9276],
        ...,
        [ 0.3736,  0.3697,  0.3659,  ...,  0.8522,  0.8538,  0.8553],
        [ 0.1498,  0.1472,  0.1448,  ...,  0.8292,  0.8310,  0.8328],
        [-0.0326, -0.0333, -0.0338,  ...,  0.8114,  0.8135,  0.8155]],
       device='cuda:0')

# this is the same as when we compute the angles as in the code where all the rays are in the current camera coordinate system.
rel_angles = F.cosine_similarity(
                                            cur_points_rays_B3N, 
                                            src_points_rays_B3N, 
                                            dim=1, 
                                            eps=1e-5
                                        )
tensor([[ 0.8830,  0.8819,  0.8809,  ...,  0.9618,  0.9625,  0.9633],
        [ 0.8325,  0.8306,  0.8288,  ...,  0.9489,  0.9498,  0.9507],
        [ 0.7592,  0.7565,  0.7538,  ...,  0.9256,  0.9266,  0.9276],
        ...,
        [ 0.3736,  0.3697,  0.3659,  ...,  0.8522,  0.8538,  0.8553],
        [ 0.1498,  0.1472,  0.1448,  ...,  0.8292,  0.8310,  0.8328],
        [-0.0326, -0.0333, -0.0338,  ...,  0.8114,  0.8135,  0.8155]],
       device='cuda:0')

# verify with diff
torch.abs(sanity1 - sanity2).mean()
tensor(5.7547e-07, device='cuda:0') # well within tolerance

I hope this answers your question and clears up any confusion.