jackroos / VL-BERT

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
MIT License
738 stars 110 forks source link

Miss the [SEQ] token when truncating the sequence too long #28

Closed weiyx16 closed 4 years ago

weiyx16 commented 4 years ago

I found a little mistake when loading the data from multimodal dataset, for example the CC. I'm not sure whether it will make a big difference. Ref to you src paper, the pretrain input is in the format of [CLS] + Caption Tokens + [SEQ] + ROIs + [END] and I also noticed that when the input sequence is over long, the code will truncate the sequence to max_len, e.g. 64, at here. My question is, when the code do text truncation, like

text = text[:text_len_keep]
mlm_labels = mlm_labels[:text_len_keep]

The text has already added [CLS] & [SEQ] in the beginning and end, so this truncation op will definitely Move Out the [SEQ] Token. I don't know whether this small mistake matters. But to keep input format, it's better to truncate the input caption tokens first, then add [CLS] & [SEQ], then convert to ids.. Thank you for your reply in advance

jackroos commented 4 years ago

@weiyx16 Thank you for pointing out this small bug. Yeah, actually I have also found this bug but I forgot to update the repo. Sorry, I would update the code soon. It won't make much difference in pre-training since sequence length is less than 64 in most cases.

jackroos commented 4 years ago

I have updated the code. Thanks again!

weiyx16 commented 4 years ago

Thank you for your reply!