visinf / self-mono-sf

Self-Supervised Monocular Scene Flow Estimation (CVPR 2020)
Apache License 2.0
248 stars 47 forks source link

Code related clarifications #3

Closed poornimajd closed 4 years ago

poornimajd commented 4 years ago

Hello @hurjunhwa and team! I am using this code to train on my own dataset.In regard to this I had a few questions as follows: I wanted to know how did you get the scaling of 0.54 in the following code: https://github.com/visinf/self-mono-sf/blob/c31a1d25cd04b3396056877ace591695002b232a/losses.py#L141 Also in the below mentioned line of code dividing the disp by 256 is only specific to kitti dataset right?: https://github.com/visinf/self-mono-sf/blob/c31a1d25cd04b3396056877ace591695002b232a/datasets/common.py#L86 Regarding the flow,I am not sure exactly what these numbers in the following line mean,are they to be changes while using a different dataset? https://github.com/visinf/self-mono-sf/blob/c31a1d25cd04b3396056877ace591695002b232a/utils/flow.py#L35 And I was able to visualize the sceneflow but I am not sure how to validate it,because I do not have the ground truth flow nor the disparity.Can you please help me out in this case? It was earlier mentioned that this line of code gives the scene flow: https://github.com/visinf/self-mono-sf/blob/5f4e07955351658fa0060e6ecadca6167693b09d/losses.py#L481 But I am unsure of what each value in this output sceneflow represent,Is it the normalized x,y and z coordinate of the motion vector? Can you please help me understand this line of code which gives the output for the sceneflow? Any help is greatly appreciated! Thank you

hurjunhwa commented 4 years ago

Hi,

  1. This is the baseline distance between two cameras in the stereo rig, 0.54 (m).

  2. and 3. Yes, these are only specific to KITTI dataset. They store the disparity or flow in uint16 type. After loading, the disparity & flow is in the pixel unit.

    4) The output scene flow is defined in the meter scale (m), not normalized. It's not easy to validate if you don't have any ground truth or pseudo ground truth of your dataset. As a sanity check, you may check whether the source image and the warped target image (using the estimated disparity and scene flow) look similar. If you know the camera intrinsics of your datasets, you may adjust the scale of the estimation accordingly by referring the KITTI's intrinsics.

Best, Jun

poornimajd commented 4 years ago

Thanks for the quick and detailed reply!

3\. The output scene flow is defined in the meter scale (m) world coordinate, not normalized.

Also a small clarification on this For example if the output of out_sceneflow is the following


`tensor([[[[-0.3044, -0.3008, -0.2973,  ...,  0.4740,  0.4754,  0.4768],
[-0.3038, -0.3004, -0.2970,  ...,  0.4728,  0.4741,  0.4753],
[-0.3032, -0.3000, -0.2967,  ...,  0.4716,  0.4727,  0.4738],
...,
[-0.1817, -0.1817, -0.1817,  ...,  0.1915,  0.1919,  0.1922],
[-0.1817, -0.1817, -0.1818,  ...,  0.1915,  0.1919,  0.1923],
[-0.1816, -0.1817, -0.1818,  ...,  0.1916,  0.1920,  0.1924]],
     [[-0.0584, -0.0579, -0.0574,  ..., -0.0122, -0.0131, -0.0140],
      [-0.0585, -0.0579, -0.0574,  ..., -0.0113, -0.0123, -0.0132],
      [-0.0585, -0.0579, -0.0574,  ..., -0.0104, -0.0114, -0.0124],
      ...,
      [ 0.0720,  0.0713,  0.0706,  ...,  0.0674,  0.0678,  0.0682],
      [ 0.0723,  0.0716,  0.0708,  ...,  0.0677,  0.0681,  0.0685],
      [ 0.0725,  0.0718,  0.0711,  ...,  0.0679,  0.0683,  0.0687]],

     [[-1.0730, -1.0736, -1.0742,  ..., -0.8998, -0.8987, -0.8977],
      [-1.0746, -1.0751, -1.0757,  ..., -0.9013, -0.9002, -0.8992],
      [-1.0761, -1.0766, -1.0771,  ..., -0.9027, -0.9017, -0.9007],
      ...,
      [-1.2378, -1.2389, -1.2400,  ..., -1.1398, -1.1389, -1.1380],
      [-1.2371, -1.2383, -1.2395,  ..., -1.1393, -1.1384, -1.1375],
      [-1.2364, -1.2376, -1.2389,  ..., -1.1388, -1.1379, -1.1369]]]],
   device='cuda:0')`

and with size- ([1, 3, 370, 1226])
Then the x coordinate (in meters) is -0.3044, similarly y(in meters) is -0.0584, z(in meters) is -1.0730 right,This means the object has moved by this value in a particular direction as compared to the previous frame right?
Sorry for the too much in depth analysis.
Any suggestion is appreciated!
hurjunhwa commented 4 years ago

Hi,

Yes, that's right. You can find the definition of the coordinate and calibration information in their paper or devkit in the dataset web page. https://www.mrt.kit.edu/z/publ/download/2013/GeigerAl2013IJRR.pdf (Fig. 1, red-colored coordinate) No worries!

rohaldb commented 3 years ago

I am running this on a custom dataset, and I'm confused about this line: https://github.com/visinf/self-mono-sf/blob/c31a1d25cd04b3396056877ace591695002b232a/losses.py#L141

If during evaluation we supply the model with only monocular images, are we able to leave this value of 0.54? It doesn't really make sense to set it to the value between two cameras when there is only one.

Thanks!

hurjunhwa commented 3 years ago

Hi, Yes, you can leave it out when testing on a custom dataset. Then, of course, the scale of the output depth and scene flow is unknown. It's only for the KITTI dataset to recover scales of depth and scene flow, given the camera intrinsic and the stereo baseline distance.

rohaldb commented 3 years ago

Thanks so much for the speedy reply!

Just to clarify, it would be unknown up to scale and shift, not just scale, correct? Apologies if this is a trivial question, my graphics/vision is not so strong!

hurjunhwa commented 3 years ago

Yes, you are right :). both scale and shift.

By the way probably as you may know, at CVPR this year, there is a very nice paper that recovers scale, shift, including the focal length: Learning to Recover 3D Scene Shape from a Single Image It would be also interesting to read!