numerical stability problem

nerfstudio-project / gsplat

CUDA accelerated rasterization of gaussian splatting

https://docs.gsplat.studio/

Apache License 2.0

1.25k stars 139 forks source link

numerical stability problem #248

Open wuzirui opened 3 days ago

wuzirui commented 3 days ago

Hi! I've recently run into a numerical stability problem with the rasterization kernel as shown below: I call the rasterization kernel with the following command: I've checked the inputs and all the tensors are guaranteed to be not nan, but when I tried to calculate the gradient of render_outputs or alpha over gaussian params (e.g. scales), nan values appear:

Could you please give me some hints on which part of the code may be numerically unstable so I can come up with some walkaround, or where I shall start to fix the problem? Thanks a lot!

maturk commented 3 days ago

Hi @wuzirui, one thing that came to mind, are you certain that all gaussians are visible by your current viewmat. If gaussians are not visible (which means they do not land withing the camera height/width after projection), they are not retained in mem and no gradients flow to them.

wuzirui commented 3 days ago

Hi @maturk! Thanks for the reply, but if this is the case, the grad value for those gaussians would be zero instead of nans? For my data, the training can proceed several hundred steps and then crash (all gaussians turn into nans in 1~2 steps), so I guess it is not likely because of visibility issues?

maturk commented 3 days ago

hmm yes that is strange, what kind of dataset are you trying to train? I am wondering if there are any problems with projection. @wuzirui, you can check the following info["radii"] =< 0 and see if the radii that are 0 are the same as the ones with nan grad.

liruilong940607 commented 3 days ago

If the scales is super small, it could be numerically instable. Maybe try adding a small eps to the scales?

scales = torch.exp(scales_crop) + 1e-4

wuzirui commented 3 days ago

hmm yes that is strange, what kind of dataset are you trying to train? I am wondering if there are any problems with projection. @wuzirui, you can check the following info["radii"] =< 0 and see if the radii that are 0 are the same as the ones with nan grad.

Thanks! But as can be seen here, all the Gaussians have positive or zero radii, none of them are negative, so I guess this is not the case?

wuzirui commented 3 days ago

If the scales is super small, it could be numerically instable. Maybe try adding a small eps to the scales?
scales = torch.exp(scales_crop) + 1e-4

Hi @liruilong940607, thanks for the comment! The minimum scale input to the kernel is 0.0024 (shown below) so I guess this is also not the case? Also I pasted the range of other input tensors as well FYI, thanks a lot!

quan5609 commented 1 day ago

Hi @wuzirui, did you figure out your problem?

wuzirui commented 19 hours ago

Hi @wuzirui, did you figure out your problem?

not yet, maybe later I need to look into the kernel