nianticlabs / monodepth2

[ICCV 2019] Monocular depth estimation from a single image
Other
4.14k stars 955 forks source link

Why learn t->t' transformation instead of t'->t? #87

Closed aj96 closed 5 years ago

aj96 commented 5 years ago

When we do the re-projection loss, why are we transforming target pixel coordinates into source pixel coordinates and then bilinearly interpolating from the source frame and comparing that to the target frame? Aren't you then comparing an image at time-step t (tgt image) to some image at time-step t' (src image)?

Wouldn't it make more sense to transform the source pixel coordinates into the target pixel coordinates and then bilinearly interpolate from the target frame? Then you can compare this predicted target frame with the actual target frame?

daniyar-niantic commented 5 years ago

The comparison is between target image and "reconstruction of target image using source pixels". This makes sense if the depth is aligned to the target frame. Hence, the source is sampled with backward warp.

aj96 commented 5 years ago

@daniyar-niantic I understand that, but why? Why are you comparing an image constructed with source pixels with an image constructed with target pixels? Wouldn't it make more sense to backproject source image coordinates with the depth of the source image, sample pixels from the target image and compare that to the target image? Why do you compare an image with source pixels with an image with target pixels? Doesn't it make more sense to compare an image with target pixels with an image with target pixels?

daniyar-niantic commented 5 years ago

@aj96 for the second option you need forward warping, which is difficult to optimize.

aj96 commented 5 years ago

@daniyar-niantic I'm not entirely familiar with the exact definitions of forward warping and inverse warping, but I found this resource: https://www.cse.huji.ac.il/course/2006/impr/lectures2006/Tirgul8_LK.pdf It seems like the only difference is which one you consider the "first" and "second" image. Going from "first" to "second" image is forward warping and going the other way around is inverse warping. It seems like it's just a matter of notation.

What exactly is the difference between forward warping and inverse warping, and why is forward warping more difficult to optimize?

mrharicot commented 5 years ago

There is no easy way to render an rgbd image using forward mapping. The main ways are splatting and rendering a mesh. Neither of them are easily differentiable .

https://www.cs.unm.edu/~angel/CS433/LECTURES/CS433_25.pdf

aj96 commented 5 years ago

@mrharicot @daniyar-niantic I still don't understand the difference between forward and inverse warping. It seems like the difference is just going from "first" image to "second" image and there's no reason why one image has to "first" and the other has to be "second". In Google's paper, Learning Depth From Videos in The Wild, their occlusion awareness depends on switching the roles of target and source.

macaodha commented 5 years ago

Perhaps this image is helpful in understanding the difference.

aj96 commented 5 years ago

@macaodha Thank you for the diagram. It was very helpful. But in the diagram you shared, it says that there can be holes when performing forward warping because we do not have the original image. But in our case, we always have the original image to interpolate what the holes should be. We have the original for all time-steps. And once again, in the paper Learning Depth From Videos in The Wild, they switch the roles of the target and source images to perform occlusion awareness.

mdfirman commented 5 years ago
  1. Yes you could interpolate what the values of the holes might be. But this is (I think) non-trivial, and I'm not sure that this would actually improve scores. Try it out for yourself!

  2. In 'Learning Depth from Videos in the Wild', I'm not sure that they are doing forward warping. I think that they are doing backward warping in each direction between the pair of images, though this is a question probably best asked to the authors of that paper (https://github.com/google-research/google-research/tree/master/depth_from_video_in_the_wild)

aj96 commented 5 years ago

@mdfirman I'll try it today. A simpler way to test this though would be to just compare to the original source image instead of the target image. @daniyar-niantic said "The comparison is between target image and "reconstruction of target image using source pixels". This makes sense if the depth is aligned to the target frame. Hence, the source is sampled with backward warp." My main point of confusion was why even call it a reconstruction of the target image if you are creating it using source pixels? If it is recreated using source pixels, then why don't you compare that reconstructed image with the actual source image when computing the loss? This would be even simpler than switching out the t->t' transformation for the t'->t transformation but still achieves the same concept.

daniyar-niantic commented 5 years ago

@aj96 I think you are confusing the source image and reconstruction of target with source pixels. Assume your target image is left view, and your source image is right view. Then the reconstruction of the left view with pixels from the right view is moving the pixels from the right view to different coordinates. So, the pixels from the right view are moved from their original coordinates to new coordinates such that the reconstruction looks like the left view. The target and reconstruction of target would be the same in ideal circumstances (perfect depth and no occlusion).

aj96 commented 5 years ago

@daniyar-niantic If we are learning the transformation t->t', where t corresponds to target (left view) and t' corresponds to source (right view), shouldn't we be moving pixels from the left view to the right view, not right view to the left view? If it is indeed moving pixels from left view to right view, then that means we should be learning transformation t'->t like I suggested in the original question.

daniyar-niantic commented 5 years ago

If you don't agree with the statements above I think you should research more about different warping techniques. If you agree with the statements above, then the conclusion is that we are using backward warping and are predicting depth that is aligned to the target image.

aj96 commented 5 years ago

@daniyar-niantic

I have been using the equation in the original zhou et al. paper as reference the entire time:

reprojection_equation

They show it as moving target pixel locations to their corresponding source pixel locations. But you consistently speak of it as moving source pixel locations to different coordinates, which, by different coordinates, I can only assume to mean target pixel locations?

So the only difference between forward warping and inverse warping is which depth map you use? This cannot be correct. If you are transforming target points to source points, then you must use depth of the target image. If you are transforming source points to target points, then you must use depth of source image.

And I do agree that you are using backward warping. I was just confused how you can use rgb pixels from source image to interpolate and then compare those to rgb pixels from target image. You are essentially comparing images at two different timesteps, right? As shown in the equation from zhou et al paper, we are transforming target points to source points, interpolating from source image, and comparing that to target image. If we can't agree on that, then I'm totally lost. If we can, then hopefully you can see why it seems strange to me that you are comparing images from different time steps.

daniyar-niantic commented 5 years ago

@aj96

There is a fundamental difference which also results in which depth map is used for warping.

If you are transforming target points to source points, then you must use depth of the target image. If you are transforming source points to target points, then you must use depth of source image.

This is correct for computing the correspondences only. How do you use these correspondences is the difference between forward and backward warping. Forward warping does "splatting" operation. The backward warping does "lookup" operation. In monodepth the lookup is done with differentiable sampler.

We don't compare "raw images" from different time steps, we compare image from current time step with the warped image from the other time step.

The difference between backward warping and forward warping is also relevant in optical flow. Please check out these: http://ctim.ulpgc.es/research_works/computing_inverse_optical_flow/ https://www.cse.huji.ac.il/course/2004/impr/lectures2004/LucasKanade.pdf https://www.cse.huji.ac.il/course/2006/impr/lectures2006/Tirgul8_LK.pdf https://www.cs.princeton.edu/courses/archive/fall11/cos429/notes/cos429_f11_lecture08_motion.pdf https://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect15.pdf

vineeth2309 commented 3 years ago

I had the same doubt for a long time, so I wanted to clarify if my understanding of what is happening is correct. For example, in this case we start with the depth image at time frame 0, and proceed to transform it to time frame +1 using the pose from the model. This gives us the transformed version of frame 0, that is further normalized between [-1,1] to form a grid. Now this would tell us for example that point (1,1) on the image at time 0, maps to (1.2,1.2) on image at time +1. We then use bilinear interpolation to sample from the actual image at time +1 to obtain the color for the pixel (1.2,1.2). Since we know that the point (1,1) on image at time 0 corresponds to point (1.2,1.2) on image at time +1, we simply replace the colour of the pixel (1,1) of image 0 with the colour we obtained after bilinear interpolation from image at time +1. This would lead to us inverse mapping image +1 to image 0. Inverse Warping