Closed Yevkuzn closed 3 years ago
Since Monodepth2 and Depth-VO-Feat are only 2 models from 20+ our baselines, there are plenty of other methods which use high resolution inputs and big models. For example, depthhint can be considered as an improved version of Monodepth2.
I am afraid that is not quite what I asked. Fair comparison to Monodepth2 is important for 2 reasons: 1) It is one of the best performing methods trained on monocular videos; 2) Your model is basically Monodepth2 + proposed improvement, so fair comparison is essential to see the effect of the proposed method.
I don't want to sound rude, but from the point of view of experienced reader, your performance gain seems to be the result of a bigger encoder. I am really curious why the reviewers at ECCV did not insist on the comparison to Monodepth2 with ResNet50 and 320x1024 input resolution.
As for Depth-Hints, I don't know what variant of Depth-Hints you compare to, since the numbers in your paper are different from the reported in the original paper. They report the following numbers for ResNet18 320x1024 MS + post-processing model:
Abs Rel Sq Rel RMSE RMSE log δ<1.25 δ<1.25^2 δ<1.25^3 0.098 0.702 4.398 0.183 0.887 0.963 0.983 You report the following numbers for their method with pp: 0.100 0.728 4.469 0.185 0.885 0.962 0.982
Looking forward to your response to the question about Monodepth2.
1、I think our ablation study can clearify this question, the base model without any improvement also got a very good performance, which is much higher than monodepth2. You can consider this gain comes from a bigger backbone and higher input resolutions (although we use a different architecture for depthdecoder). It not easy to make further imrpovement on such a high performance base model, but our technique really did. 2、The numbers of DepthHint in our paper is from table2 row 10 in its paper, the numbers you refer to is not using KITTI dataset. You can see that depthhint already use ResNet50 and 320*1024 input resolution (in table 3 with pp), so I afraid it is no need to make extra experiments.
I hope this explanation can help, looking forward to further discussion.
Unfortunately, your explanation only raises more questions. Anyway, thank you for the effort. I do not see any point in further discussion, since our perception of what research is seems to be very different.
Also very currious about how FeatDepth performs with ResNet 18 on 640x192 input :) Wondering if you could provide the weight?
Sorry, I don't maintain pretrained weights of this settings.
I wonder if the reported performance choice the best weights from different epochs in training phase according to the loss on the test set? As I konw, monodepth2 used the last epoch for their reported results. @sconlyshootery @Yevkuzn
Hi, you got some great numbers, but I have several questions regarding the comparison to Monodepth2 and Depth-VO-Feat (Zhan et al).
First of all, your depth and feature encoders are ResNet50 with input resolution of 320×1024, right? Why do you then compare to Monodepth2 with ResNet18 encoder in Table 2? Table 6 of Monodepth2 shows that changing encoder from ResNet18 to ResNet50 reduces RMSE by more than 0.22 even with lower resolution input. Is there any performance gain when compared to Monodepth2 in a fair way (same encoder, same resolution)?
Depth-VO-Feat introduce a very similar idea, but back then it was implemented using a significantly weaker baseline. How does FeatDepth compare to Depth-VO-Feat using a better reconstruction loss (e.g. from Monodepth2)?