maturk / dn-splatter

DN-Splatter: Depth and Normal Priors for Gaussian Splatting and Meshing
https://maturk.github.io/dn-splatter/
Apache License 2.0
493 stars 32 forks source link

How to ensure the consistency of Normal map obtained by prediction. #10

Open VictorStarkSnow opened 7 months ago

VictorStarkSnow commented 7 months ago

Hi author, thanks for your excellent work! I'm also interested in this area, here are some questions about your work, hope to get your reply, thanks a lot! 1、How do you ensure the consistency of Normal map obtained by prediction, since you're already aligned Depth map with the SfM Points? 2、As far as I learned from 2DGS, it seems Normal map supervision has more influence to Mesh than Depth map. Do you agree with that and do you have any idea how to align Normal map?

maturk commented 7 months ago

Hi @VictorStarkSnow , could you clarify your questions a bit? Regarding normal supervision, there are two techniques that I have seen in the literature in the past few months. 1) supervising by the gradient of depth maps. In the ideal scenario, (i.e. perfect depth map renders), the gradient of the depth map (computed by e.g. finite differences but there are also some other methods that maybe make smoother predictions), or more specifically the cross product of the gradients in the x and y directions, give the direction of the normal of the surface. However, in this project, I experimented with this method and noticed it can lead to quite noisy predictions, mainly because the rendered depths are noisy themselves and any derivatives amplify this.

So the second method that I tried out it is 2) supervision of normals with monocular-normal estimation networks. The idea is very simple, just use some off-the-shelf nn to predict normals and supervise on these. They work great on indoor-datasets since most monocular networks are trained on vast data from mainly indoor room scans.

Regarding your second questions, you are correct. Normal supervision helps in e.g. Poisson Surface reconstruction which also requires normals for meshing. I think there is definitely room for improvement in this direction as well, e.g. joint optimization of depths and normals (derived from depths) since at the end-of-the day, the normal maps and depth maps are related to each other. As I mentioned before, in the ideal scenario, you can obtain normal estimates from depth renders. I actually do this with my Replica dataset which has "perfect" GT depth maps and I generate normals from them. This way I could also compute metrics regarding how good my GS normal estimates where (with simple metrics like MSE or cosine similarity scores). This is difficult to do with other datasets since it is very difficult to get ground truth normals.

VictorStarkSnow commented 7 months ago

Thanks for you kindly reply!

  1. I observed the same phenomenon as you do (the cross product of the gradients in the x and y direction lead to noisy Normal map). So, I choose DSINE to predict the Normal map which get better result than the gradient of Depth maps. However, when I use "perfect" Normal map (get from DSINE) as supervision, the result of GS become worse (more floaters), which might cause by inconsistency of Normal map (I suspect). So, I wonder if Normal map need to be aligned like the Depth map except coordinate system conversion?
  2. How big is the difference between Depth map and aligned Depth map to the final GS result (e.g. PSNR) since your code provides two options?
maturk commented 7 months ago

@VictorStarkSnow regarding monocular depth supervision, we used zoe metric depth and depthanything metric depth estimators (not apart of the code base, but can be run from the official depthanything repo). These networks try to achieve metric scale depth estimates, however, there is often still a scale/shift difference between your metric poses (if you have real metric pose data) and the monocular depth estimates. The depth alignment tries to account for this as a per-depth-frame alignment problem. The difference can be substantial, especially on difficult datasets. If you do not even have metric scaled poses, e.g. colmap poses with an arbitrary scale, then this alignment is critical to make use of metric depth estimators. However, it is also possible to use relative depth losses (not explored in this work) which do not require metrically correct depth estimates, this has been explored in prior NeRF literature and now in 3DGS context as well. Just as an anecdote, the general performance of metric depth estimators is surprisingly good; however, I am not up-to-date with the latest developments in this field. But using these priors and developments for enhancing inverse-rendering problems (for free basically), like in the case of 3DGS, is what I tried to do with this project. There are still some difficulties and my project is by no means even close to a perfect solution.