Questions about paper COCO detection numbers

microsoft / esvit

EsViT: Efficient self-supervised Vision Transformers

MIT License

408 stars 46 forks source link

Questions about paper COCO detection numbers #4

Closed gabrielhuang closed 3 years ago

gabrielhuang commented 3 years ago

Hi all,

In table 4 of the arxiv preprint https://arxiv.org/pdf/2106.09785.pdf, the reported AP^bb of Supervised = 46.0 Why is this number lower than the ones reported in the Swin paper ?

See Table 2 (b) of https://arxiv.org/pdf/2103.14030.pdf
Swin-S AP^box=51.8

Also, what object detection method are you using? Is it Mask RCNN or Cascade? There is no mention of the detection method used in the paper.

Thanks!

ChunyuanLI commented 3 years ago

Thanks for the question. I used Mask R-CNN as the object detection method, when reporting results in Table 4.

I think that Cascade Mask R-CNN was used in Table 2(b) of Swin paper. You may want to take a look at numbers reported in the following link, where the results are matched.

https://github.com/SwinTransformer/Transformer-SSL