nianticlabs / monodepth2

[ICCV 2019] Monocular depth estimation from a single image
Other
4.12k stars 952 forks source link

Pose on identity transformation #343

Closed VladimirYugay closed 1 year ago

VladimirYugay commented 3 years ago

Hey there,

Thanks for your work!

To check whether the method is applicable for my dataset, I've intentionally overfitted the network for one specific data sequence with a moving camera. For most of the sequence, the camera's baseline is small (rotation is close to identity and translation is also small), but sometimes it turns around by quite some angle.

However, egomotion is always very close to an identity matrix for every frame.

These two images are an example of a time when camera movement is low:

0000 0001

These two images are an example of a time when camera rotation is large: 0344 0345

Is it even possible to train the network for such sort of data?

mdfirman commented 3 years ago

It may well be. Might be worth giving it a go!

But for two sets of images:

It might be worth looking at the KITTI sequences (https://www.youtube.com/watch?v=KXpZ6B1YB_k) to get a sense of what types of camera motion monodepth2 works well with.

VladimirYugay commented 3 years ago

Thanks for your response!

I've increased the baseline to simulate visual movement like in KITTI, sampled -5, 0, 5.

Before starting training, I've tried to overfit on a small subset of data, by taking only 5 triplets of images from the same sequence with quite some egomotion. I've trained from scratch (except encoders) and got the following disparity maps

0045

0045

The lines seem very accurate, but depth is far from being close to the truth. When the real depth is in meters ranging from 0 to 80, the depth obtained from the disparity map above with disp_to_depth is very flat, almost constant all over the image, ranging from 1.9 to 1.92.

Have you seen similar behavior before or may any hints on how to make the training more stable?

mdfirman commented 3 years ago

I haven't seen things quite like this before – it looks quite strange.

Have you adjusted the intrinsics in the dataloader to reflect these intrinsics? This is important to do otherwise the training isn't going to work.

Separately:

Are you able to render stereo pairs from this dataset? If so it is much easier to debug intrinsics, dataloading and training with a stereo dataset. Then once that works you can switch to the harder task of mono training.

VladimirYugay commented 3 years ago

Yes, sure, I've set it up in my datalodaer

        self.K = np.array([[1158, 0, 960, 0], [0, 1158, 540, 0], [0, 0, 1, 0], [0, 0, 0, 1]], dtype=np.float32)

What do you mean under stereo pairs? I do not really have any stereo pairs in my dataset, it's monocular.

mdfirman commented 3 years ago

Ah – it looks like you might be using unnormalised intrinsics. Take a look at the KITTI intrinsics we use and the comment above:


        # NOTE: Make sure your intrinsics matrix is *normalized* by the original image size.
        # To normalize you need to scale the first row by 1 / image_width and the second row
        # by 1 / image_height. Monodepth2 assumes a principal point to be exactly centered.
        # If your principal point is far from the center you might need to disable the horizontal
        # flip augmentation.
        self.K = np.array([[0.58, 0, 0.5, 0],
                           [0, 1.92, 0.5, 0],
                           [0, 0, 1, 0],
                           [0, 0, 0, 1]], dtype=np.float32)

We would expect the numbers in your normalised K to be around 1.0, rather than around 1000.

On stereo pairs: That's ok – if you don't have stereo pairs then that's fine. (If you did have them, it can be really useful to help to debug)

VladimirYugay commented 3 years ago

Thanks for the hint. Tried that, but even after that scaling the intrinsic properly I have almost the same predictions and training metrics. The loss itself is low and doesn't decrease (you can ignore depth_rmse).

Screenshot from 2021-06-02 14-43-13

I've also tried to train only on three triplets from KITTI dataset to check whether I can overfit on them with frame ids 0, -1, 1.

Screenshot from 2021-06-02 17-08-39

The results looks somewhat similar to what I got with my own dataset (you can ignore depth_rmse and this is not disparity, but depth)

0_000000

mrharicot commented 3 years ago

Hi, Could you try using a KITTI pretrained model on your gta V data? Usually if the intrinsics are good the model should be able to fine tune on the new data (as long as the relative poses aren't too difficult to learn).

VladimirYugay commented 3 years ago

Yes, tried it. More or less the same obscure depth maps. I've also tried visualizing prediction and target in the compute_loss after training for 20 epochs (warped images from -10 and 10 frames vs target image):

0_-10 0_10

My current problem seems very similar to this issue, but no answer there. Also, I've seen quite some comments of a similar nature: loss is around 0.13 and just doesn't converge.

daniyar-niantic commented 3 years ago

What are the intrinsics after normalization?

VladimirYugay commented 3 years ago

Originally, for images of size 1080x1920 the intrinsics are:

  self.K = np.array([
                     [1158, 0,  960, 0],
                     [0,  1158, 540, 0],
                     [0, 0, 1, 0],
                     [0, 0, 0, 1]], dtype=np.float32)

I'm resizing the images for training to 576x960, and scale intrinsics accordingly:

[[579.    0.      480.   0. ]
 [0.     617.6    288.   0. ]
 [  0.    0.       1.    0. ]
 [  0.    0.       0.    1. ]]

Finally, after normalizing (this one is used throughout training):

[[0.603 0.    0.5   0.   ]
 [0.    1.072 0.5   0.   ]
 [0.    0.    1.    0.   ]
 [0.    0.    0.    1.   ]]
mdfirman commented 3 years ago

Thanks for these – these intrinsics look like they are in a reasonable range, I think. (BTW where did you get the focal lengths for this GTA data from?)

Do the depths from the pretrained KITTI model look reasonable? If so, you can use the KITTI model to check if everything is set up correctly without needing to retrain:

If it looks bad, some things to check:

Overall, for this sort of debugging, tensorboard is your friend. Check you can get reasonable reprojected images. And you can do this without retraining.

Let us know how you get on!

VladimirYugay commented 3 years ago

Thanks for such an elaborate response! It definitely helped me a lot!

  1. Intrinsics are taken from here. They're 100% correct, I've constructed point clouds with them and they look really nice

  2. I just wiped out the whole repo and started from "scratch" again. The loss finally started to go down, I disabled flip, kept the KITTI image size, considered both KITTI and GTA datasets separately, and checked the following options:

Maybe this happens because the problem itself has infinitely many solutions

  1. I actually have the ground truth depth maps in meters and while the predicted depth visually looks fine even after overfitting is not on the same scale. For example, the predicted value is 2, while the real one is 11. In the repo you only scale the depth for the stereo setup where the transformation between cameras is known. Is there any way to do this having the gt depth maps but in a monocular setup?
mdfirman commented 3 years ago

Ok great – so it works when starting from KITTI, but it's the scratch training which is still the problem?

Might be worth trying with the --v1_multiscale option, to see how that helps. Not quite sure why it would help but worth trying.

Do you have any more images from your dataset to share? I wonder if the types of textures in them might just be especially problematic.

VladimirYugay commented 3 years ago

It works starting from kitti, visually it looks good, but quantitatively not really for GTA.

Yes, would be great to have the ability to train it from scratch.

Well, I can't overfit even on Kitti tracking left triplet, so maybe the texture is not a problem. I've just taken 000000.png, 000001.png, 000002.png for 0000 sequence. The loss goes down but the disparity doesn't look good. The only thing I've changed was the get_image_path in KITTIRAWDataset:

Screenshot from 2021-06-08 16-58-01

Screenshot from 2021-06-08 17-00-58

mdfirman commented 3 years ago

Yes I would expect that this would be a problem to overfit to small sections of KITTI – this type of self-supervised training really benefits from seeing lots of different images, to help the network get out of local minima.

How many GTA images are you training from?

VladimirYugay commented 3 years ago

Ok, I see, maybe it's the same for GTA. On GTA I also tried only 3 triplets, 9 images in total. But in general, I have around a million.

What do you think might be a reasonable number of triplets to overfit on?

mdfirman commented 3 years ago

Ah ha! This could be the problem, I'd hope.

I'd take a similar number as are in the KITTI dataset (I forget exactly how many)

But maybe at least 10k triplets?

mdfirman commented 3 years ago

Please do let us know if you have any progress – we'd love to see some GTA-trained depth maps!

VladimirYugay commented 3 years ago

@mdfirman yes, of course, I'm currently training, will post as soon as ready!

VladimirYugay commented 3 years ago

@mdfirman I've finally finished my experiments. Here are some outputs in case someone is interested.

TLDR:

  1. The model fails to overfit on a smaller data sample and need a certain chunk of triplets (8k in my case)
  2. Silhouettes on the depth maps look fine, but the scale is off
  3. Best option is to start from a checkpoint
  4. The training itself is not really stable
  5. Egomotion quality is not tested visually yet

Some details on the dataset. The dataset is not really a GTA dataset, but rather a GTA dataset for specific scenarios where we have a lot of sequences and really crowded scenes viewed from the pedestrian's viewpoint. This dataset should be published soon and currently is in pre-print.

The dataset contains both dynamic and static camera sequences and has a really small baseline. I've sampled only dynamic sequences like took every 5th image to form the triplets to simulate the movement in KITTI. I've also increased the resolution to 540x960. For the first shot, I've selected only 8k images in train and 1k images in validation. In the dataset, I also have the ground-truth depth and egomotion, so I was able to compute losses in the validation step for them.

I tried the following things:

Method Depth RMSE Egomotion L1
8k_hr_pretrained (20 epochs) 24.49 0.02
8k_hr_pretrained_5 (20 epochs) 20.74 0.13
8k_hr (30 epochs, init with imagenet) 24.87 0.02
8k_hr_5 (30 epochs, init with imagenet) 22.62 0.11
8k_hr_5 (30 epochs, scratch) 15.6 (Dead) 0.11 (Dead)



Here 8k is for the number of images, hr stands for high resolution, '_5` is for the window in which we take the images (in this case -5, 0, 5). Obviously, the smaller the window the smaller the egomotion error.

Despite training from the very scratch without imagenet initialization gives the smallest error, the depth was completely flat, close to 0 everywhere and the reconstructed images looked fine. The options with the pre-trained model worked best and it was selected to train on larger dataset size.

I trained the model with checkpoint initialization with -5, 0, 5 frame ids on 50k images and 12k images in the validation set for 50 epochs. The results for the depth silhouettes look fine, but the scale is very different. For instance, in the case where we have a depth equal to 17, it's around 3 in the prediction. Moreover, we have an issue of infinite depth, but this was already discussed in the issues here and I'm thinking of moving here for this particular issue to utilize segmentation masks we also have.

Regarding the losses, the training doesn't look stable. I'm interested in whether you also had a similar loss/metric behavior:

Screenshot from 2021-06-27 13-24-10

Depth metrics:

Screenshot from 2021-06-27 13-24-23

Egomotion metrics:

Screenshot from 2021-06-27 13-29-38

daniyar-niantic commented 1 year ago

Thanks for reporting back!