qinenergy / corda

[ICCV 2021] Code for our paper Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation
71 stars 12 forks source link

How to obtain your depth datasets? #5

Closed tudragon154203 closed 3 years ago

tudragon154203 commented 3 years ago

Hi, thanks for your great work!

It would be great if you can elaborate more on how you obtain the monocular depth estimation.

I understand that you've uploaded the dataset, but it would be really helpful if I know exactly how you've done it.

From your paper, in the ablation study part: "We would like to highlight that for both stereo and monocular depth estimations, only stereo pairs or image sequences from the same dataset are used to train and generate the pseudo depth estimation model. As no data from external datasets is used, and stereo pairs and image sequences are relatively easy to obtain, our proposal of using self-supervised depth have the potential to be effectively realized in real-world applications."

So I image you get your monocular depth pseudo ground truth by:

  1. Downloading target domain videos (here Cityscapes. Btw, where do you get Cityscapes videos?)
  2. Train a Monodepth2 model on those videos (for how long?)
  3. Use the model to get pseudo ground truth Then repeat to the source domain (GTA 5 or Synthia)

Am I getting it right? And is there any more important points you want to highlight when calculating such depth labels?

Regards, Tu

lhoyer commented 3 years ago

Yes, you are right about the general procedure. You can get the Cityscapes videos from the official download page (leftImg8bit_sequence_trainvaltest.zip). Possibly, you have to apply to get access to these particular files. For GTA, you can find the video sequences at https://playing-for-benchmarks.org/download/.

Here are some additional details on the depth estimation procedure.

Monocular Depth Estimation

For self-supervised monocular depth estimation from image sequences, we follow the implementation of Godard et al. [1]. In particular, we use a ResNet50 backbone with a U-Net decoder, which is trained for 200k iterations with a batch size of 4 and an initial learning rate of 1e-5 for the encoder and 1e-4 for the decoders. After 150k iterations, the learning rates are decreased by a factor of 10. In contrast to Monodepth2, we deploy an additional ASPP module with dilation rates 3, 6, and 9 between encoder and decoder for multi-scale context feature aggregation, use BatchNorm in the decoder for faster convergence, and apply random cropping of size 512x512 for data augmentation.

Stereo Depth Estimation

The depth estimation can also be generated from stereo pairs. In this work, we use the publicly available stereo estimates generated by Sakaridis et al. [2].

[1] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. Digging into self-supervised monocular depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. [2] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 126(9):973–992, September 2018.

tudragon154203 commented 3 years ago

Thank you so much. That's much clearer to me now!