shvdiwnkozbw / SMTC

Code for Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos
5 stars 0 forks source link

About the label propagation result #6

Open arunos728 opened 4 months ago

arunos728 commented 4 months ago

Following the updated source code, I've tested both ImageNet pretrained ViT (DINO) and the uploaded pretrained SMTC model. I've checked the pretrained SMTC model reaching 65.2 J&F, but ImageNet pretrained ViT reaching 65.9 J&F which outperformed SMTC model.

I've also tried to reproduce the result by training the architecture on Youtube-VOS (start_twostage.sh, 7K iterations), but it reaches 63.5 J&F which is lower than 65.2 J&F. The weird thing is, when I trained the architecture 15K iterations, the result became 55.4 J&F even the loss was going down well.

Considering the result of the ImageNet pretrained ViT was 65.9 J&F, there is no improvement on labal propagation in any case. Could you explain why this happens? Is it still related to the batch size? I've used 64 x 2 (total 128) for the batch size.

shvdiwnkozbw commented 4 months ago

Thanks for your feedback. We rechecked it and find two reasons leading to this phenomenon. (1) We are sorry that there is a bug in evaluating the ViT label propagation results. Among the compared methods, the models with resent backbone are of downsample stride of 8, but the ViT backbone is of 16. We evaluate the performance of ViT-S/16 using the groundtruth masks with downsample rate 8 by mistake. This results in the phenomenon that the ImageNet pretrained ViT reaching 65.9 J&F, surpassing the performance reported in the original DINO paper. (2) The relaxed valid sample filtering standard could result in some low-quality instances in training, which hurts the dense correspondence learning. To address this limitation, we have updated the code by freezing the ViT encoder part of the teacher model to stablize training. And we observe that in this condition, with a total of 10k iterations training, we could achieve stable label propagation performance and surprisingly high object discovery performance on DAVIS-2017-UVOS, 44.8 J&F.

arunos728 commented 4 months ago

Thanks for the answer. Following the answer (1), I've tested both the pretrained ViT & the pretrained SMTC after modifying the downsample stride from 8 to 16 (args.mapscale (8,8) -> (16,16)). The ViT result was 59.2 J&F and the SMTC was 60.8 J&F. I think the both results are lower than the expected (original ViT result was 61.8 J&F), could you update the evaluation code properly? In addition, it would be much better if you guys can upload the pretrained weights of the new condition (the answer (2)).

shvdiwnkozbw commented 4 months ago

Hi, thanks for your report on the problems in label propagation. We carefully rechecked the evaluation code, and revised the bugs that lead to the performance gap. (1) In the original uploaded evaluation code, it did not check the height/width of the original images. Some images in DAVIS-2017 have dimensions that are not multiples of patch size, so that there is information loss in the patch embedding process (since the convolution stride of ViT patch embedding equals to patch size and it has no padding). It is necessary to apply a resize operation to resize the images into height/width of multiples of patch size. And this operation significantly improves the performance of ViT models. The reproduced results of DINO pretrained ViT-S/16 is much higher than reported in the original paper. (2) For the resutls of ViT-S/16 reported in our paper, we perform label propagation on the downsample ratio of 8 by mistake. We have updated a new version to arXiv, and report the results under both downsample ratio of 8 and 16 settings, as well as the newly reproduced DINO results. In the new version, the gap between DINO and our tuned model becomes negligible. We believe there is much potential to further improve the correspondence capacity of ViT models. For example, under the revised evaluation setting (adding resize operation to prevent information loss in patch embedding process), the DINO ViT-S/8 is comparable to state-of-the-art CNN models with downsample ratio 8.

The new model weights are available at https://drive.google.com/file/d/162dtjPXQ2r4lghg6W5Vu8x2lRj0EJmtU/view?usp=drive_link, with 64.0/67.6 J&F score on DAVIS-2017 semi-supervised, and 44.8 J&F on DAVIS-2017 unsupervised.