microsoft / GLIP

Grounded Language-Image Pre-training
MIT License
2.22k stars 193 forks source link

Some questions about this paper? #53

Open Zhangwenyao1 opened 2 years ago

Zhangwenyao1 commented 2 years ago

Thanks for your awesome work, but I want to know how to produce these regions? Are you use RPN or other method ?

Haotian-Zhang commented 2 years ago

Hi @Zhangwenyao1, thank you for your interest in GLIP work. We don't need to produce these regions. The major difference is that the traditional BUTD VL models may require a two-stage pre-training: first, pre-train on detection modules, then pre-train on alignment modules. However, GLIP unifies the detection and grounding to become a single "grounded VL understanding" pre-training task.

Zhangwenyao1 commented 2 years ago

I mean how does GLIP V1 get region features, in my opinion, the visual encoder(CNN or Transformer) will encode the image into embedding, but in this paper, the output(O) of the visual encoder still looks like a patch or region.

Haotian-Zhang commented 2 years ago

To get region features, we follow the way of dynamic head, which is a one-stage object detector. The features are extracted from 1x1 anchor boxes, and the way to define the positive and negative anchors is by following the ATSS.