microsoft / FIBER

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
MIT License
127 stars 11 forks source link

Question of Classification Result #8

Closed chaos1992 closed 1 year ago

chaos1992 commented 1 year ago

Thanks for your great work! How can I reproduce the result of Table 18 in the paper?

Looking forward to your reply!

zdou0830 commented 1 year ago

Hello, you can use the Swin Transformer code (https://github.com/microsoft/Swin-Transformer) with our pretrained weights.

chaos1992 commented 1 year ago

@zdou0830 Sorry for another question. What is the language input of Fig. 3(c) when I infer FIBER? I mean, does FIBER need to have language input when we're doing an image caption task?

zdou0830 commented 1 year ago

For image captioning, FIBER generates tokens autoregressively as in standard seq2seq models.

chaos1992 commented 1 year ago

Got it, thanks