nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.37k stars 1.27k forks source link

Camera Optimizer has different pose adjustments when applied to Raybundle and Camera. #3073

Closed nexuslrf closed 4 months ago

nexuslrf commented 6 months ago

Hi developers,

I found a pretty tricky computation inconsistency issue on camera_optimizer's apply_to_* functions:

https://github.com/nerfstudio-project/nerfstudio/blob/babf57718c75e7d04f1b977f035db67c02a9c20a/nerfstudio/cameras/camera_optimizers.py#L148-L168

The rotation and translation shown in the above two functions are completely different. Assume the original pose is [R, t], and pose adjustment in CameraOptimizer is [dR, dt]

I initially wanted to evaluate the quality of camera optimization of a trained NeRF model by applying this apply_to_camera function but found such a computation mismatch. I guess these functions would also be confusing for other users (for sure 🤣

It would be better in the future version to add some caution notes and an option to switch to different forms of pose adjustments. For example,

# raybundle_cam: bool
if raybundle_cam:
    R = torch.bmm(adj[:, :3, :3], camera.camera_to_worlds[:, :3, :3])
    t = adj[:, :3, 3:] + camera.camera_to_worlds[:, :3, 3:]
    camera.camera_to_worlds = torch.cat([R, t], dim=-1)
else:
    adj = torch.cat([adj, torch.Tensor([0, 0, 0, 1])[None, None].to(adj)], dim=1)
    camera.camera_to_worlds = torch.bmm(camera.camera_to_worlds, adj)
jb-ye commented 5 months ago

Thanks for sharing the knowledge. I think the correction transform (dR, dt) should be multiply to T_camera_to_world transform from the left, which is (dR @ R, dR t + dt ). This is neither of what was implemented in apply_to_raybundle and apply_to_camera.

Think about the case when we have pose estimate of a sensor rig that contains two cameras. The two cameras have their own pose estimates subject to their fixed extrinsics T_cam1_to_cam2, such that T_cam2_to_world * T_cam1_to_cam2 = T_cam1_to_world. A shared correction transform to the pose estimate of this sensor rig must be multiplied from the left to each of its cameras.

@oseiskar Just want to learn your thoughts on this?

oseiskar commented 5 months ago

@jb-ye I did not actually check this in detail in https://github.com/nerfstudio-project/nerfstudio/pull/2891/commits/e8e05ff02f1c604124a0373afcca7691f2b88546, and just kept the multiplication order from the original 3DGS PR

I agree with you that it would be better to multiply camera-to-world from the left since that would also work for multi-camera rigs: If you have a single adjusment matrix for all cameras in the rig, the multiplying with that from the left would work for adjusting the pose of the rig correctly.

However, as long as there are no such rig constraints but all cameras are optimized independently, then all of the versions discussed in this thread are OK. They are just different ways to parametrize adjustments to a SE(3) pose.

The main problem in the current version seems to be that the NeRF (ray bundle) and 3DGS use different interpretations for the parameters, which is not nice since it breaks in cases like @nexuslrf 's test.

It might be safest to fix the 3DGS version to match the NeRF one, since at least for 3DGS, all versions should be fine, but for NeRFs, there could be more performance implications (just guessing here, though). If support for multi-camera setups is added, then both version should probably be fixed to use the multiply-from-the-left form.

brentyi commented 5 months ago

Thanks everyone for the discussion! Some quick thoughts.

The reason for the original apply_to_raybundle() implementation was likely runtime: the reconstruction quality for all of these implementations (left-multiply, right-multiply, the current behavior) should all be basically the same, but (dR @ R, t + dt) (what's implemented) has one less matrix multiplication than (dR @ R, dR @ t + dt). I recall some tests being run and the overhead being nontrivial.

Assuming this is true, one suggestion is to keep the current behavior for NeRFs, but match the splatfacto behavior for consistency and document it better: it's a separate rotation around the camera frame origin + translation in world coordinates.[^1]

[^1]: Edit: on a second read this is exactly what @oseiskar suggested in the comment above. sorry for not reading more thoroughly!

Think about the case when we have pose estimate of a sensor rig that contains two cameras. The two cameras have their own pose estimates subject to their fixed extrinsics T_cam1_to_cam2, such that T_cam2_to_world * T_cam1_to_cam2 = T_cam1_to_world. A shared correction transform to the pose estimate of this sensor rig must be multiplied from the left to each of its cameras.

If we want to switch to a single rigid transform, between left- vs right- multiplied my very weak personal bias is actually for right-multiplied, mostly because it seems like the standard in packages like Ceres and gtsam. I don't find the multi-sensor rig case super convincing, since (a) you have to write custom code anyways and (b) right-multiplication is more convenient for the complementary case where you have good global rig poses but don't trust per-sensor calibrations relative to the rig.

oseiskar commented 5 months ago

If we want to switch to a single rigid transform, between left- vs right- multiplied my very weak personal bias is actually for right-multiplied, mostly because it seems like the standard in packages like Ceres and gtsam. I don't find the multi-sensor rig case super convincing, since (a) you have to write custom code anyways and (b) right-multiplication is more convenient for the complementary case where you have good global rig poses but don't trust per-sensor calibrations relative to the rig.

I would like to add that being able to handle all relevant cases with a single rigid transform, which is either left or right multiplication is not a very sustainable goal. In a system that can deal with cases such as multi-camera rigs and possibly extrinsic calibration correction, you will anyway end up simultaneously transforming the poses from both left (tune the pose of the rig) and right (fixed or adjustable transform within the local rig coordinates) and some of these transforms will be shared between multiple cameras.

... and, without multi-camera support, it does not really matter so I also support this:

Assuming this is true, one suggestion is to keep the current behavior for NeRFs, but match the splatfacto behavior for consistency and document it better: it's a separate rotation around the camera frame origin + translation in world coordinates.