According to Figure 1 in the paper, for Visual Feature Embedding of image regions, VL-BERT is using features of bounding box proposals. It seems you are concatenating the whole image with other bounding boxes in every dataset.py by assigning add_image_as_a_box=True (e.g. VQA). In my understanding, your code utilizes this first box (whole image) for Visual Feature Embedding of words other than image regions (with [IMG] token). However, it seems your code didn't remove this full image box after that so there's one additional (full image) box at the front of boxes?
I can't see why it make sense to just train a joint new (Token + Visual Feature) Embedding for the final word [END] (e.g. in class VisualLinguisticBert)? It neither used predefined [END] token like [SEP] and [CLS] nor used the full image box as Visual Feature embedding. Did I miss anything here?
We do use the whole image as the first box in implementation, and it is just a design choice.
The [END] token is just a special token, it means the end of the sequence, like [SEP] in BERT. You can refer to this paper for the effect of such kind of special tokens. And whether to add visual embedding in it, I think, would not influence too much.
Thanks!
@jackroos Thanks for your prompt reply! I see. I am convinced of the first point but I think for the second point it's a bit misleading for your figure :-) Maybe you would like to clarify it in your paper.
Thank you for your brilliant work.
dataset.py
by assigningadd_image_as_a_box=True
(e.g. VQA). In my understanding, your code utilizes this first box (whole image) for Visual Feature Embedding of words other than image regions (with[IMG]
token). However, it seems your code didn't remove this full image box after that so there's one additional (full image) box at the front ofboxes
?[END]
(e.g. in class VisualLinguisticBert)? It neither used predefined[END]
token like[SEP]
and[CLS]
nor used the full image box as Visual Feature embedding. Did I miss anything here?