How the split token [SEP] work?

ttlmh / Bridge-Prompt

[CVPR2022] Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

94 stars 7 forks source link

How the split token [SEP] work? #21

Open Lycus99 opened 1 year ago

Lycus99 commented 1 year ago

Thanks for your study. When I read the code, I wonder how the split token [SEP] works.

self.cnt_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) self.sep_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) ... sep_token = self.sep_token.expand(nvids, -1, -1) x = torch.cat((cnt_token, ord_emb, sep_token, x), dim=1)

From the code above, the [SEP] token calculates attention with other tokens without additional “split” methods. And it is also set as a learnable parameter.

ttlmh commented 1 year ago

Thank you for your interest in our work!

The code initializes two learnable parameters, self.cnt_token and self.sep_token, both of shape (1, 1, embed_dim). The self.sep_token parameter is then expanded to match the shape of the input x with nvids examples, and concatenated with self.cnt_token, ord_emb, and x, along the second dimension (dim=1).

By doing so, the [SEP] token acts as a delimiter between the ord_emb and x parts of the input feature vector. The resulting concatenated tensor can then be passed through the transformer-based model, which can learn to attend to the different parts of the input based on the presence of the [SEP] token.

Lycus99 commented 1 year ago

Thank you for your interest in our work!

The code initializes two learnable parameters, self.cnt_token and self.sep_token, both of shape (1, 1, embed_dim). The self.sep_token parameter is then expanded to match the shape of the input x with nvids examples, and concatenated with self.cnt_token, ord_emb, and x, along the second dimension (dim=1).

By doing so, the [SEP] token acts as a delimiter between the ord_emb and x parts of the input feature vector. The resulting concatenated tensor can then be passed through the transformer-based model, which can learn to attend to the different parts of the input based on the presence of the [SEP] token.

Thanks for your reply! I understand the description you reply. But I wonder how you made sure the [SEP] token worked the way you wanted it to. From the perspective of the attention mechanism, each token will be calculated with all other tokens to obtain attention scores. Is it necessary to set this [SEP] token?

clearlove7-star commented 1 year ago

Thank you for your interest in our work!

The code initializes two learnable parameters, self.cnt_token and self.sep_token, both of shape (1, 1, embed_dim). The self.sep_token parameter is then expanded to match the shape of the input x with nvids examples, and concatenated with self.cnt_token, ord_emb, and x, along the second dimension (dim=1).

By doing so, the [SEP] token acts as a delimiter between the ord_emb and x parts of the input feature vector. The resulting concatenated tensor can then be passed through the transformer-based model, which can learn to attend to the different parts of the input based on the presence of the [SEP] token.

Excuse me, can you describe the fusion process in detail? I 'm so sorry, I don 't know why to add the initial value. x = x.type(x_original.dtype) + x_original