Closed xuezhongcailian closed 1 year ago
Sorry for the late reply. The CAM version in this repo is a slightly different from the paper in which we use the "unfold" operation to firstly get BxB (5x5) blocks and then the attention map is calculated within each block. The 0/1 mask is multiplied on the input instead of the attention map for causality.
hello author: How to understand the Causual Attention Module, I read the code of the Causual Attention Module , and I don't quite understand it, x_masked = x_unfold * self.mask.to(x_unfold.device) attn = (q @ k.transpose(-2, -1)) # BHW, num_heads, PP, PP Is this consistent with the picture below in the paper?![image](https://user-images.githubusercontent.com/31470730/185176784-29d53376-fbff-4ab5-b33c-930d56bc6233.png)