Tracking convergence criteria

muskie82 / MonoGS

[CVPR'24 Highlight] Gaussian Splatting SLAM

https://rmurai.co.uk/projects/GaussianSplattingSLAM/

Other

1.1k stars 95 forks source link

Tracking convergence criteria #98

Open leblond14u opened 1 month ago

leblond14u commented 1 month ago

Dear community, authors,

As explained in #93 I'm trying to use the MonoGS tracker to retreive a camera pose.

In my particular case I am training a gaussian splatting scene and then trying to retreive a pose between two close cameras belonging to my training dataset. To do so I used the MonoGS rasterizer to get the jacobian computation and the tracking function from the MonoGS/utils/slam_frontend.py. I'm training the scene with a simple gaussian splatting training loop and trying to optimize the pose at the end.

So far I'm unable to get a proper pose estimation with my experiment. I'm unable to reach convergence when using 100 gradient descent iterations, and when I reach convergence with 2000 iterations my results are really far from the goal pose. In fact the I even tend to see the estimation diverging from my goal pose ...

As an example, I'm initializing my translation vector as [ 1.96003743 0.59285834 -0.86803944], my end goal is to estimate this translation vector [ 1.96642986 0.66022297 -0.83239669] and after 100 iterations I get this estimate : [ 2.1265576 -0.12794144 -0.41376677].

I'm looking at the learning rates and the convergence factor (update_pose(camera, converged_threshold=1e-4)) as possible ways to improve the estimations. Though the learning rate is constant in all the config files provided ...

Have somebody faced a similar problem or could guess what's causing the issue ? Any ideas on how to set the converged_threshold value or learning rates to fit my use case ?

Thanks in advance, Best,

Hugo

Edit : I tried changing the learning rates but it seems that the gradient is simply not descended ... No matter what learning rate I use I don't see the result converging towards the goal pose. I paid attention to the attachment of the projection matrices and extrinsics to the computational graph and this doesn't seem to be the issue here :/

YufengJin commented 1 month ago

Dear,

I am also encountering unusual results in camera pose estimation. I replaced the differential rasterizer in the original GS work and pre-trained on the Blender dataset for 30,000 iterations, after which I saved the gaussians. Subsequently, I ceased the optimization of the gaussian model, focusing solely on adjusting the camera position. I introduced an error margin of +/- 0.2 meters and +/- 5 degrees to the camera, yet the final camera position merely fluctuates within a small range and fails to converge to the correct location. I utilized both RGB and depth as loss metrics. I am perplexed as to whether the issue lies in the design of the supervision loss or the gradient of the camera delta pose. Your insights would be highly appreciated.

WFram commented 1 month ago

I've experienced similar issues as yours. By adding gradient computation to the GES from diff-gaussian-rasterization-w-pose, I've been trying to estimate the camera poses of my training dataset given ground truth splats (exported as .ply from GT dense point cloud), as well as ground truth camera poses, noisyfied with different magnitudes. The parameters of the splats were fixed.

I've observed that when optimizing both rotational and translational components, I get divergence. Meanwhile, I'm steadily getting convergence to ground truth poses (usually within 1-5 cm margin), when NOT updating the rotational component. The magnitudes of additional translational noise differ within (5-50 cm). So, when using gradients to update the translation vector only, in my tests, poses converge. I'm using the original photometric loss from 3DGS (both L1 and SSIM losses are involved, without outlier rejection by image gradients or opacity). Also, I use the lowest order of spherical garmonics so that the color doesn't depend on the rotation.

So, I think there is an issue with optimizing rotational component. I've been investigating gradients, but haven't found them to be computed incorrectly. I wonder if somebody experiences the same behavior.

YufengJin commented 1 month ago

@WFram Thank you very much for sharing your insights. I conducted a similar test. My test involves using a pre-trained Gaussian model from Blender datasets (optimized for 30,000 iterations), then manually adding varying degrees of translation and rotation errors, and subsequently optimizing only the camera position. However, I found that the camera position is very sensitive to the image plane UV coordinates, and despite trying different learning rates, I have not succeeded in optimizing to the correct camera position. As you mentioned, there is indeed an issue with the camera's optimization on rotation. I suspect that even if the gradient direction is correct, the loss may not converge if there is an excessively large gradient scale in a particular direction. It is also possible that the loss function is inherently less sensitive to rotation.

I have some experience regarding differentiable rendering, and I wonder if you would be interested in discussing this further. Perhaps through our exchange of ideas, we might come up with some innovative solutions. Looking forward to your response.

WFram commented 1 month ago

@YufengJin Yes, it would be interesting to discuss this further. I also have some concerns about the sensitivity of the loss to the rotation matrix. In the last experiments, I was using image pyramids to optimize camera poses in coarse-to-fine manner. Like in Direct Sparse Odometry. But I used downscaling factors of 2, 4, while in DSO the pyramid has 5 layers and the image at the highest level is downscaled by 16 w.r.t. the original image. It's better to start optimizing rotation at the coarsest pyramid level due to the high non-linearity of the cost function. You can try experimenting with this.

WFram commented 1 month ago

Also, I think different exposure times might impact the optimization. It's not the case for me, since I've been using synthetic data with fixed exposure times. In MonoGS they estimate affine brightness parameters, but they are used only for two frames in the loss. Before being projected to the image plane, all the gaussians are blended from a larger number of frames, and simply compensating brightness in the loss computation might not be enough. I think it's better to correct for exposure in preprosessing step, or just fix the exposure when recording image stream.

YufengJin commented 1 month ago

@WFram First of all, thank you for your suggestions. I will give them a try. Additionally, due to reflections and inconsistent lighting, the colors of the splats are inconsistent during the training process. If we do not use the SH color encoder to represent colors, this issue can be somewhat mitigated. Regarding camera position optimization, I have another idea. We can use multiple viewpoints to optimize the camera position.

leblond14u commented 1 month ago

I've experienced similar issues as yours. By adding gradient computation to the GES from diff-gaussian-rasterization-w-pose, I've been trying to estimate the camera poses of my training dataset given ground truth splats (exported as .ply from GT dense point cloud), as well as ground truth camera poses, noisyfied with different magnitudes. The parameters of the splats were fixed.

I've observed that when optimizing both rotational and translational components, I get divergence. Meanwhile, I'm steadily getting convergence to ground truth poses (usually within 1-5 cm margin), when NOT updating the rotational component. The magnitudes of additional translational noise differ within (5-50 cm). So, when using gradients to update the translation vector only, in my tests, poses converge. I'm using the original photometric loss from 3DGS (both L1 and SSIM losses are involved, without outlier rejection by image gradients or opacity). Also, I use the lowest order of spherical garmonics so that the color doesn't depend on the rotation.

So, I think there is an issue with optimizing rotational component. I've been investigating gradients, but haven't found them to be computed incorrectly. I wonder if somebody experiences the same behavior.

Can I ask how you isolated the translation optimization from the rotational one ?

I think my test kinda validate what you are experiencing. When the process is converging (and I do not converge a lot) I often have a "rather ok" translational error and a huge rotational error. I'll run further tests to really validate this on my side.

WFram commented 1 month ago

Can I ask how you isolated the translation optimization from the rotational one ?

By setting the rotational state update camera.cam_rot_delta.data to zero right after pose_optimizer.step() .

P. S.: Make sure that your rotation is actually expressed in the camera frame (world-to-camera, Rcw), if you haven't checked yet. Rasterizer expects the pose to be in the camera frame. Not sure about 3DGS, but in GES I found out R in Camera class to be finally expressed in the world frame (camera-to-world, Rwc), while translation T is expressed in camera frame (world-to-camera, tcw).

Il-castor commented 3 weeks ago

Dear community, authors,

As explained in #93 I'm trying to use the MonoGS tracker to retreive a camera pose.

In my particular case I am training a gaussian splatting scene and then trying to retreive a pose between two close cameras belonging to my training dataset. To do so I used the MonoGS rasterizer to get the jacobian computation and the tracking function from the MonoGS/utils/slam_frontend.py. I'm training the scene with a simple gaussian splatting training loop and trying to optimize the pose at the end.

So far I'm unable to get a proper pose estimation with my experiment. I'm unable to reach convergence when using 100 gradient descent iterations, and when I reach convergence with 2000 iterations my results are really far from the goal pose. In fact the I even tend to see the estimation diverging from my goal pose ...

As an example, I'm initializing my translation vector as [ 1.96003743 0.59285834 -0.86803944], my end goal is to estimate this translation vector [ 1.96642986 0.66022297 -0.83239669] and after 100 iterations I get this estimate : [ 2.1265576 -0.12794144 -0.41376677].

I'm looking at the learning rates and the convergence factor (update_pose(camera, converged_threshold=1e-4)) as possible ways to improve the estimations. Though the learning rate is constant in all the config files provided ...

Have somebody faced a similar problem or could guess what's causing the issue ? Any ideas on how to set the converged_threshold value or learning rates to fit my use case ?

Thanks in advance, Best,

Hugo

Edit : I tried changing the learning rates but it seems that the gradient is simply not descended ... No matter what learning rate I use I don't see the result converging towards the goal pose. I paid attention to the attachment of the projection matrices and extrinsics to the computational graph and this doesn't seem to be the issue here :/

Hello! How to train the Gaussian Splatting on my scene ? I have already written code for load my dataset and yaml file. Which command you used ?

Thank you and your response is very helpfully

leblond14u commented 2 weeks ago

@Il-castor You are on the wrong issue here. I can only encourage you to check out the readme and other issues on custom dataset loading :)