Hi, I have gone through your code. Very interesting work. Can you please explain the input to calculate input MLM logits for caption generation? I have tried input in the formats: 1. image_feature,[SEP], [MASK],[PAD]...[PAD] 2. image_feature,[CLS], [MASK],[PAD]....[PAD] 3. [CLS], [MASK],[PAD]...[PAD],[SEP],image_feature; this will be in loop. Which one is the correct format?
Thanks!
Hi, I have gone through your code. Very interesting work. Can you please explain the input to calculate input MLM logits for caption generation? I have tried input in the formats: 1. image_feature,[SEP], [MASK],[PAD]...[PAD] 2. image_feature,[CLS], [MASK],[PAD]....[PAD] 3. [CLS], [MASK],[PAD]...[PAD],[SEP],image_feature; this will be in loop. Which one is the correct format? Thanks!