hustvl / YOLOS

[NeurIPS 2021] You Only Look at One Sequence
https://arxiv.org/abs/2106.00666
MIT License
827 stars 118 forks source link

Object Detection LB #1

Closed jaideep11061982 closed 3 years ago

jaideep11061982 commented 3 years ago

❔Question

Congratulation for publishing a good work. How is performance wrt to YOLO5 and other YOlo series and also its standing on Object detection LB.

Additional context

Yuxin-CV commented 3 years ago

Hi @jaideep11061982, thanks for your interest in our work!

As mentioned in our paper, YOLOS is not designed to be a sophisticated high-performance object detector. On the contrary, we purposefully make modifications as few as possible on a given pre-trained ViT / DeiT to precisely unveil the versatility and transferability of Transformer from image recognition to object detection. So our paper is more about Transformer than object detection in a sense.

We argue that the 2D object detection is a quite hard task for naive Transformer since ViT always does seq2seq modeling, which means ViT tries to perceive higher dimension visual signal from a lower dimension perspective. Nevertheless, we observe that ViT can accomplish this task.

Transformer can benefit from super large-sized model and super large-scale pre-training. In our paper, we only use the mid-sized ImageNet-1k as the pre-training dataset, and the largest model we study has 128M parameters. Whether object detection results can benefit from the excellent scalability of Transformer is interesting.

Yuxin-CV commented 3 years ago

We believe we have answered your question, and as such I'm closing this issue, but let us know if you have further questions.