sconlyshootery / FeatDepth

This is the offical codes for the methods described in the "Feature-metric Loss for Self-supervised Learning of Depth and Egomotion".
MIT License
247 stars 28 forks source link

Evaluation and performance #8

Closed Yevkuzn closed 3 years ago

Yevkuzn commented 3 years ago

Hi, you got some great numbers, but I have several questions regarding the comparison to Monodepth2 and Depth-VO-Feat (Zhan et al).

First of all, your depth and feature encoders are ResNet50 with input resolution of 320×1024, right? Why do you then compare to Monodepth2 with ResNet18 encoder in Table 2? Table 6 of Monodepth2 shows that changing encoder from ResNet18 to ResNet50 reduces RMSE by more than 0.22 even with lower resolution input. Is there any performance gain when compared to Monodepth2 in a fair way (same encoder, same resolution)?

Depth-VO-Feat introduce a very similar idea, but back then it was implemented using a significantly weaker baseline. How does FeatDepth compare to Depth-VO-Feat using a better reconstruction loss (e.g. from Monodepth2)?

sconlyshootery commented 3 years ago

Since Monodepth2 and Depth-VO-Feat are only 2 models from 20+ our baselines, there are plenty of other methods which use high resolution inputs and big models. For example, depthhint can be considered as an improved version of Monodepth2.

Yevkuzn commented 3 years ago

I am afraid that is not quite what I asked. Fair comparison to Monodepth2 is important for 2 reasons: 1) It is one of the best performing methods trained on monocular videos; 2) Your model is basically Monodepth2 + proposed improvement, so fair comparison is essential to see the effect of the proposed method.

I don't want to sound rude, but from the point of view of experienced reader, your performance gain seems to be the result of a bigger encoder. I am really curious why the reviewers at ECCV did not insist on the comparison to Monodepth2 with ResNet50 and 320x1024 input resolution.

As for Depth-Hints, I don't know what variant of Depth-Hints you compare to, since the numbers in your paper are different from the reported in the original paper. They report the following numbers for ResNet18 320x1024 MS + post-processing model:

Abs Rel Sq Rel RMSE RMSE log δ<1.25 δ<1.25^2 δ<1.25^3 0.098 0.702 4.398 0.183 0.887 0.963 0.983 You report the following numbers for their method with pp: 0.100 0.728 4.469 0.185 0.885 0.962 0.982

Looking forward to your response to the question about Monodepth2.

sconlyshootery commented 3 years ago

1、I think our ablation study can clearify this question, the base model without any improvement also got a very good performance, which is much higher than monodepth2. You can consider this gain comes from a bigger backbone and higher input resolutions (although we use a different architecture for depthdecoder). It not easy to make further imrpovement on such a high performance base model, but our technique really did. 2、The numbers of DepthHint in our paper is from table2 row 10 in its paper, the numbers you refer to is not using KITTI dataset. You can see that depthhint already use ResNet50 and 320*1024 input resolution (in table 3 with pp), so I afraid it is no need to make extra experiments.

I hope this explanation can help, looking forward to further discussion.

Yevkuzn commented 3 years ago

Unfortunately, your explanation only raises more questions. Anyway, thank you for the effort. I do not see any point in further discussion, since our perception of what research is seems to be very different.

HanzhiC commented 3 years ago

Also very currious about how FeatDepth performs with ResNet 18 on 640x192 input :) Wondering if you could provide the weight?

sconlyshootery commented 3 years ago

Sorry, I don't maintain pretrained weights of this settings.

haoweiz23 commented 2 years ago

I wonder if the reported performance choice the best weights from different epochs in training phase according to the loss on the test set? As I konw, monodepth2 used the last epoch for their reported results. @sconlyshootery @Yevkuzn