type_vocab_size of pretrained model should be 2, right? But it shows 3. For my understanding, there are only two types in the pretraining: one for texts and one for images. So do I miss something?
Since in some task, the text contains two sentences, for example, question and answer in VQA and VCR, so we use different segment embedding following BERT.
type_vocab_size of pretrained model should be 2, right? But it shows 3. For my understanding, there are only two types in the pretraining: one for texts and one for images. So do I miss something?