nianticlabs / monodepth2

[ICCV 2019] Monocular depth estimation from a single image
Other
4.16k stars 962 forks source link

Questions about the meaning of grid in the F.grid_sample function #488

Open pnpmpnp opened 10 months ago

pnpmpnp commented 10 months ago

First of all, thank you very much for sharing such a good and high-quality research. Also, I am very impressed with the concept and implementation details behind Monodepth2.

However, there is one instance in your code that I don't quite understand.

For backward warping, I understood that the depth map obtained from the target view is moved to the coordinates of the source view, and the value of the source image is taken to the coordinates generated, resulting in a loss between the reconstructed target view and the original target view, and this is the reason for using backward warping (I also read the other comments and found them very helpful).

But what I'm wondering is what the grid used in the F.grid_sample function means in this case, i.e. there is an assumption that the image in the target view should be seen in these positions in the source view, and I think it is the GT pose that ensures this. Whether that warping is a GT pose, or if we go from target -> source using the optimal pose after training, what exactly does the resulting coordinate mean? Because if it's a complete source coordinate, it's a bit awkward to represent the target view created when we put the source image in F.grid_sample.

Is it possible to understand the coordinate created by going to target view -> source view as an offset that tells the source view where the pixel should go to correspond to the reference view?

Looking forward to your thoughts and thank you.