Closed gourango01 closed 3 years ago
Encoder only, and we use a very simple method: [Mask] all then predict them. You can easily defeat the existing result if trying to use a useful decoder.
During the fine-tuning on the image captioning task, Did you use any pre-training task (for e.g., AKPM, TIM and AMLM) along with the fashion captioning task i.e., given an image ( i.e., sequence of image patches generated by "Kaleido Patch Generator") predict the corresponding caption?
Thanks for sharing this interesting work. Would you please share how "Kaleido-BERT" has been fine-tuned on captioning task? Have you used separate decoder for generation or "Kaleido-BERT" encoder only?