About the feature extraction of the datasets

While using the BERT processed datasets, for example CMU-MOSI provided by MMSA, I found something confusing. As the data is composed by ['train', 'valid', 'test'], let's take dataset['train'] as an example:

train = data['train']
train['text_bert'][0]
>> array([[ 101, 1037, 2843, 1997, 6517, 3033,  102,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [   1,    1,    1,    1,    1,    1,    1,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0]])

As we all know, token_id 0 in BERT tokenizer refers to [PAD], of which the meaning is the padding token of the current sequence. Thus, it is supposed to map the exact same semantic vector. However, when I print out the last few vectors of the text embedding sequence, it turns out like this:

train['text'][0][-1]
>>> array([-2.08179697e-01,  1.68636113e-01,  9.52486321e-02,  1.25335917e-01,
       -1.33985206e-01,  1.50015324e-01, -3.54664534e-01,  3.44460368e-01,
        1.07902482e-01, -2.58357879e-02, -7.40882829e-02,  5.81327416e-02,...])
train['text'][0][-2]
>>> array([-1.52151436e-01,  3.07559490e-01,  1.41099229e-01,  8.55545700e-02,
       -1.30305454e-01,  1.83018059e-01, -4.52749103e-01,  4.19770420e-01,
        4.92582396e-02, -1.23697348e-01,...])

Due to the length of the vector, I only showed the first few datas, but you can see they are not the same. I don't know if I don't understand Bert well enough or if the data is being preprocessed differently. I am looking forward for your early reply.

thuiar / MMSA

About the feature extraction of the datasets #111