Open hgaiser opened 1 year ago
This is actually cool. Given that we have SwinTransformer and ViT a are being offered stably. Both of these can be used as backbones for ViTDet.
A major challenge would be actually reproducing the metrics with torchvision reference scripts. It's not best idea to port weights. (we have done that for SwinTransformer I guess, but it's better if we can train)
Bandwidth might be something @NicolasHug can answer :smile:
@fmassa is there any interest for this?
🚀 The feature
ViTDet achieves very interesting results on COCO and, given that ViT is already implemented, it seems relatively straightforward to implement this in torchvision.
Motivation, pitch
The best performing object detection network in torchvision is currently FasterRCNN with a resnet50 backbone (46.7 mAP). ViTDet reports an mAP 51.6 with ViT-B backbone, 55.6 with ViT-L and an impressive 56.7 mAP with ViT-H. Similarly impressive results have been obtained with the instance aware segmentation implementation.
Alternatives
Detectron2 implements ViTDet. It could be decided that torchvision will not provide its own implementation and instead redirects users that want to use ViTDet to Detectron2.
Additional context
Implementing ViTDet opens the door to other implementations, such as EVA-02. EVA-02 achieves even better results compared to ViTDet.
I have previously implemented RetinaNet for torchvision (later merged in https://github.com/pytorch/vision/pull/2784). I might be interested in implementing ViTDet, but I would first like to see if there is interest by the maintainers.