labmlai / annotated_deep_learning_paper_implementations

🧑‍🏫 60+ Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, sophia, ...), gans(cyclegan, stylegan2, ...), 🎮 reinforcement learning (ppo, dqn), capsnet, distillation, ... 🧠
https://nn.labml.ai
MIT License
56.54k stars 5.8k forks source link

mha.py array shapes #262

Open erlebach opened 4 months ago

erlebach commented 4 months ago

I wonder why array shapes in aha are (C, B, D) rather than (B, C, D). I thought it was convention that the batch was the first dimension. Specially, here are the first few lines of the forward method of class MultiHeadAttention:

    def forward(self, *,
                query: torch.Tensor,
                key: torch.Tensor,
                value: torch.Tensor,
                mask: Optional[torch.Tensor] = None):
        """
        `query`, `key` and `value` are the tensors that store
        collection of *query*, *key* and *value* vectors.
        They have shape `[seq_len, batch_size, d_model]`.      <<<<<<<<

        `mask` has shape `[seq_len, seq_len, batch_size]` and
        `mask[i, j, b]` indicates whether for batch `b`,
        query at position `i` has access to key-value at position `j`.
        """

Thanks.

dingyue772 commented 1 week ago

same question, with the original code in the class MultiHeadAttention in mha.py. Cause the following logic, the softmax will operate cross batch, which I don't understand. Need help.

# the defination of softmax
self.softmax = nn.Softmax(dim=1)
# the usage of softmax
attn = self.softmax(scores) # here scores have a shape of [seq_q, seq_k, heads, d_k]