Open Lycus99 opened 1 year ago
Thank you for your interest in our work!
The code initializes two learnable parameters, self.cnt_token and self.sep_token, both of shape (1, 1, embed_dim). The self.sep_token parameter is then expanded to match the shape of the input x with nvids examples, and concatenated with self.cnt_token, ord_emb, and x, along the second dimension (dim=1).
By doing so, the [SEP] token acts as a delimiter between the ord_emb and x parts of the input feature vector. The resulting concatenated tensor can then be passed through the transformer-based model, which can learn to attend to the different parts of the input based on the presence of the [SEP] token.
Thank you for your interest in our work!
The code initializes two learnable parameters, self.cnt_token and self.sep_token, both of shape (1, 1, embed_dim). The self.sep_token parameter is then expanded to match the shape of the input x with nvids examples, and concatenated with self.cnt_token, ord_emb, and x, along the second dimension (dim=1).
By doing so, the [SEP] token acts as a delimiter between the ord_emb and x parts of the input feature vector. The resulting concatenated tensor can then be passed through the transformer-based model, which can learn to attend to the different parts of the input based on the presence of the [SEP] token.
Thanks for your reply! I understand the description you reply. But I wonder how you made sure the [SEP] token worked the way you wanted it to. From the perspective of the attention mechanism, each token will be calculated with all other tokens to obtain attention scores. Is it necessary to set this [SEP] token?
Thank you for your interest in our work!
The code initializes two learnable parameters, self.cnt_token and self.sep_token, both of shape (1, 1, embed_dim). The self.sep_token parameter is then expanded to match the shape of the input x with nvids examples, and concatenated with self.cnt_token, ord_emb, and x, along the second dimension (dim=1).
By doing so, the [SEP] token acts as a delimiter between the ord_emb and x parts of the input feature vector. The resulting concatenated tensor can then be passed through the transformer-based model, which can learn to attend to the different parts of the input based on the presence of the [SEP] token.
Excuse me, can you describe the fusion process in detail?
I 'm so sorry, I don 't know why to add the initial value.
x = x.type(x_original.dtype) + x_original
Thanks for your study. When I read the code, I wonder how the split token [SEP] works.
self.cnt_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) self.sep_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) ... sep_token = self.sep_token.expand(nvids, -1, -1) x = torch.cat((cnt_token, ord_emb, sep_token, x), dim=1)
From the code above, the [SEP] token calculates attention with other tokens without additional “split” methods. And it is also set as a learnable parameter.