noahzn / Lite-Mono

[CVPR2023] Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
MIT License
540 stars 61 forks source link

Training on Mid-Air Dataset #129

Closed Amor-ZYF closed 6 months ago

Amor-ZYF commented 7 months ago

Thanks for your work! Now I want to train the lite-mono-8m model on the Mid-Air Dataset, and in this dataset there are depth maps expressed in meters and stored as 16-bit float matrices in lossless 1-channel PNGs. The depth maps are shown as follows: 000000 So that's what I do to create a MidAirDataset class: midair midair2 Then I changed the min_depth to 1 and max_depth to 1250, for it is the depth range of Mid-Air dataset. And in the function compute_depth_loss, I changed the related parameters to fit my dataset. But when I checked the tensorboard I found that all the metrics went bad, then I found that the depth that I predicted is 1.00000 for all pictures. Could you please help me to analyze what is the problem? Appreciated for your reply!

noahzn commented 7 months ago

Hi, I want to know some more information.

  1. Is the intrinsics matrix correctly computed? Here is a reference.
  2. Can you try using a smaller depth range? Since the disp_to_depth function convert a sigmoid output to a depth value, the range [0, 1250] is too large maybe.
  3. I found that the depth that I predicted is 1.00000 for all pictures.

Do you mean that all the pixels are predicted to be 1? Could you show some predicted depth images? In your get_depth function, if you use np.log for the ground-truth, it probably also needs the same operation on the prediction?

  1. If you change the depth range, you also need to change them in options.py
Amor-ZYF commented 7 months ago
  1. Yes, I already computed the intrinsics matrix. Ss the technical specifications of Mid-Air dataset points out that for an image of height h and width w, the matrix is given by fx=cx=w/2 and fy=cy=h/2, I think I have the correct intrinsics matrix.
  2. In the function process_batch I have the outpus['disp', scale] of one picture saved as txt files. And I find that it seems to be weird. disp0 The values of outputs['disp', 0] are all around 0.50000. Do you think it is related to the range [0, 1250]?
  3. As it is shown above, the values of outputs['disp', 0] are around 0.50000, so I think that leads to the bad depth prediction as follows: depth In my get_depth function I followed the example script of Mid-Air dataset to decode the ground-truth depth maps. But I think that although I use np.log for the ground-truth, it shouldn't have any influence on the outputs['disp', scale]? I think I only need to do the same operation when I want to get the real-world depth from the outputs[('depth', 0, scale)] in the picture above.
  4. I have already changed the depth range in options.py.
  5. There is one more thing that makes me confused: in the pictures shown in tensorboard, I find that though I got bad metrics, the reconstructed frame looks the same as target frame. 对比 As shown above, the first frame is the target frame and the second frame is the reconstructed frame. So I don't understand if I get a wrong predicted depth, how can I get the reconstructed frame that seems correct?

Thank you for spending your time to help me!

noahzn commented 7 months ago

Hi, according to the definition of the sigmoid function. If you get all the output close to 0.5 it means that your input to sigmoid is almost all 0. Have you normalized your images before inputting them to the network? Can you check this first?

noahzn commented 6 months ago

@Amor-ZYF Hi, have you found what the problem is?

Amor-ZYF commented 6 months ago

Not yet haha. But I guess it is because this dataset is very different from the self-driving datasets, and it contains a lot of grass and leaves. And they are all green that may let the module predict the same depth values for a whole area. That may lead to a huge difference between the prediction and the ground-truth.

noahzn commented 6 months ago

I think the problem is not from the dataset itself. I have also tried my network on real indoor drone images and high-altitude drone images and both worked.

@Amor-ZYF Also please make sure that adjacent frames have obvious motions. You can use this option to control that. If your dataset has high fps, then you probably need to change this, for example, [0, -5, 5]

noahzn commented 6 months ago

I am now closing this thread due to lack of response. You can reopen it or create a new issue if you have further questions.