About adding whole image as a bounding box & the embedding for [END] token

coldmanck commented 4 years ago

Thank you for your brilliant work.

According to Figure 1 in the paper, for Visual Feature Embedding of image regions, VL-BERT is using features of bounding box proposals. It seems you are concatenating the whole image with other bounding boxes in every dataset.py by assigning add_image_as_a_box=True (e.g. VQA). In my understanding, your code utilizes this first box (whole image) for Visual Feature Embedding of words other than image regions (with [IMG] token). However, it seems your code didn't remove this full image box after that so there's one additional (full image) box at the front of boxes?
I can't see why it make sense to just train a joint new (Token + Visual Feature) Embedding for the final word [END] (e.g. in class VisualLinguisticBert)? It neither used predefined [END] token like [SEP] and [CLS] nor used the full image box as Visual Feature embedding. Did I miss anything here?

jackroos commented 4 years ago

We do use the whole image as the first box in implementation, and it is just a design choice.
The [END] token is just a special token, it means the end of the sequence, like [SEP] in BERT. You can refer to this paper for the effect of such kind of special tokens. And whether to add visual embedding in it, I think, would not influence too much. Thanks!

coldmanck commented 4 years ago

@jackroos Thanks for your prompt reply! I see. I am convinced of the first point but I think for the second point it's a bit misleading for your figure :-) Maybe you would like to clarify it in your paper.

jackroos / VL-BERT

About adding whole image as a bounding box & the embedding for [END] token #8