question about mask during inference

Hi. I would like to ask regarding the att_masks for Image Captioning:

In the data loading, you already prepare the att_masks : https://github.com/microsoft/Oscar/blob/master/oscar/run_captioning.py#L324

During inference, you re-process the att_masks here:: https://github.com/microsoft/Oscar/blob/master/oscar/modeling/modeling_bert.py#L658?

If I understand correctly: 1) at the first timestep (when predicting the first word) you are cutting the portion of the mask where the caption is (except for the first [MASK] which will generate the first word). 2) In the next timestep, you are re-ordering the inputs and masks to [od_labels, img, caption] 3) In the following timesteps, you cutt off this re-ordered mask according to the current length.

Is my understanding correct? And if so, is there any reason why you dont re-order first, then start predicting the first word and so on...)?

Thanks!

microsoft / Oscar

question about mask during inference #116