Closed weiyx16 closed 4 years ago
@weiyx16 Thank you for pointing out this small bug. Yeah, actually I have also found this bug but I forgot to update the repo. Sorry, I would update the code soon. It won't make much difference in pre-training since sequence length is less than 64 in most cases.
I have updated the code. Thanks again!
Thank you for your reply!
I found a little mistake when loading the data from multimodal dataset, for example the CC. I'm not sure whether it will make a big difference. Ref to you src paper, the pretrain input is in the format of [CLS] + Caption Tokens + [SEQ] + ROIs + [END] and I also noticed that when the input sequence is over long, the code will truncate the sequence to max_len, e.g. 64, at here. My question is, when the code do text truncation, like
The text has already added [CLS] & [SEQ] in the beginning and end, so this truncation op will definitely Move Out the [SEQ] Token. I don't know whether this small mistake matters. But to keep input format, it's better to truncate the input caption tokens first, then add [CLS] & [SEQ], then convert to ids.. Thank you for your reply in advance