Thanks a lot for your paper and code!
In your implementation, you didn't set attention mask for text sequence both in textTransformer layers and LinearTemporalCrossAttention layers, why it didn't cause any influence? Below is the related code.
def encode_text(self, text, device):
with torch.no_grad():
text = clip.tokenize(text, truncate=True).to(device)
x = self.clip.token_embedding(text).type(self.clip.dtype) # [batch_size, n_ctx, latent_dim]
x = x + self.clip.positional_embedding.type(self.clip.dtype)
x = x.permute(1, 0, 2) # NLD -> LND
x = self.clip.transformer(x)
x = self.clip.ln_final(x).type(self.clip.dtype)
Thanks a lot for your paper and code! In your implementation, you didn't set attention mask for text sequence both in textTransformer layers and LinearTemporalCrossAttention layers, why it didn't cause any influence? Below is the related code.
def encode_text(self, text, device): with torch.no_grad(): text = clip.tokenize(text, truncate=True).to(device) x = self.clip.token_embedding(text).type(self.clip.dtype) # [batch_size, n_ctx, latent_dim] x = x + self.clip.positional_embedding.type(self.clip.dtype) x = x.permute(1, 0, 2) # NLD -> LND x = self.clip.transformer(x) x = self.clip.ln_final(x).type(self.clip.dtype)
T, B, D
class LinearTemporalCrossAttention(nn.Module):\