If I understand correctly:
1) at the first timestep (when predicting the first word) you are cutting the portion of the mask where the caption is (except for the first [MASK] which will generate the first word).
2) In the next timestep, you are re-ordering the inputs and masks to [od_labels, img, caption]
3) In the following timesteps, you cutt off this re-ordered mask according to the current length.
Is my understanding correct? And if so, is there any reason why you dont re-order first, then start predicting the first word and so on...)?
Hi. I would like to ask regarding the att_masks for Image Captioning:
In the data loading, you already prepare the att_masks : https://github.com/microsoft/Oscar/blob/master/oscar/run_captioning.py#L324
During inference, you re-process the att_masks here:: https://github.com/microsoft/Oscar/blob/master/oscar/modeling/modeling_bert.py#L658?
If I understand correctly: 1) at the first timestep (when predicting the first word) you are cutting the portion of the mask where the caption is (except for the first [MASK] which will generate the first word). 2) In the next timestep, you are re-ordering the inputs and masks to [od_labels, img, caption] 3) In the following timesteps, you cutt off this re-ordered mask according to the current length.
Is my understanding correct? And if so, is there any reason why you dont re-order first, then start predicting the first word and so on...)?
Thanks!