How the camera pose is initialized and optimized

muskie82 / MonoGS

[CVPR'24 Highlight & Best Demo Award] Gaussian Splatting SLAM

https://rmurai.co.uk/projects/GaussianSplattingSLAM/

Other

1.25k stars 108 forks source link

How the camera pose is initialized and optimized #27

Closed QZH-00 closed 5 months ago

QZH-00 commented 5 months ago

Thanks for this great work ! ! ! When I read the code , I'm trying to figure out how the camera pose is initialized and optimized . The only part of the code I found that was related to initialization was this one, and it used the ground truth value of the dataset for initialization. viewpoint = Camera.init_from_dataset( self.dataset, cur_frame_idx, projection_matrix ) and viewpoint.update_RT(viewpoint.R_gt, viewpoint.T_gt) So I'm confused as to why the ground truth value is used to initialize, but the camera gradient is still calculated and optimized afterwards? Isn't the initialized truth value the system optimal value? This may be a stupid question, but I would like to get an explanation from you .Thanks ! !

rmurai0610 commented 5 months ago

Hi,

The initialize function is only called to initialise the slam pipeline, not for every frames ( called when a reset is triggered).

So only the first frame’s camera pose is initialised at the ground truth (to make it easier to visualise again gt poses) Also, you can just initialise anywhere really (E.g. at identity)

All the frames are initialised around the last frame’s pose, as you can see in tracking function. https://github.com/muskie82/MonoGS/blob/main/utils/slam_frontend.py#L130

init_from_dataset only sets the estimated camera pose to identity, gt pose is also stores but it’s for convenience only (for evaluation etc)

Hope this clarifies your question!

QZH-00 commented 5 months ago

Thanks for your Thank you very much for your reply! I have one more question. We consider the ground truth to be the best pose estimation, so is the ground truth also the best choice for optimizing the parameters and rendering images for 3DGS ? To put it another way, suppose I set the pose to the ground truth (or close to it) for every frame, will I get better rendering results? After my experiment, I found that when the ATE RMSE is extremely small, the psnr becomes lower instead, why is that? evo_2dplot_final 2024-03-19 21-56-26 的屏幕截图

Given your greater experience and knowledge, I'd appreciate a reply!

rmurai0610 commented 5 months ago

It depends on many factors, but real world ground truth may not be the best for reconstruction since they are not aligned pixel perfectly.

Optimising the camera poses adds additional slack to the system, where some real world imperfections can be explained by moving the camera poses slightly.

If you try using gt poses for replica, which is a synthetic dataset, I expect all metrics to be better. Otherwise you can run SfM like COLMAP, just like original 3D Gaussian splatting.

Hope this makes sense!

QZH-00 commented 5 months ago

Thank you for your reply ! ！ When I did more experiments on the EuRoC dataset, I found that there is a big difference in performance between monocular and stereo. The point cloud in monocular mode is very messy, while the point cloud in stereo mode is very regular (similar to the point cloud of classical methods like DSO) So, does this mean that using the direct method VO to get the pose and the point cloud as an initialisation for 3DGS is a direction worth trying? For classical monocular VO, Will the lack of scale and accurate depth have a bad effect on the initialisation? Wish your reply！ @muskie82 @rmurai0610 Thanks ! !

muskie82 commented 5 months ago

Hi,

Yes, you can bootstarp out system with external tracker/VO's pose or initial points. The focus of our work is purely 3DGS-based SLAM system to address the 3DGS's intrinsic property for camera localisation task, but in practice you can use external pose/depth priors for an imeediate performance boost.

Since the initial question of this issue is solved, I will close this issue.