Closed chaos1992 closed 1 year ago
Hello, you can use the Swin Transformer code (https://github.com/microsoft/Swin-Transformer) with our pretrained weights.
@zdou0830 Sorry for another question. What is the language input of Fig. 3(c) when I infer FIBER? I mean, does FIBER need to have language input when we're doing an image caption task?
For image captioning, FIBER generates tokens autoregressively as in standard seq2seq models.
Got it, thanks
Thanks for your great work! How can I reproduce the result of Table 18 in the paper?
Looking forward to your reply!