salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.46k stars 193 forks source link

Question about VQA answer tokenizer #91

Closed katrina433 closed 2 years ago

katrina433 commented 2 years ago

I realized that an eos token [SEP] is added to each answer when creating the VQA dataset here. Since BertTokenizer already appends a [SEP] token to the end of each input text anyways, is there a reason why an additional eos token is added to each answer? (the tokenized input_ids of each answer ends with two 102s (the sep_token_id)).

LiJunnan1992 commented 2 years ago

Hi, we have customized BertTokenizer to not automatically add [SEP] after text. https://github.com/salesforce/ALBEF/blob/main/models/tokenization_bert.py

katrina433 commented 2 years ago

I see, thanks for the clarification!