Specifying the mask in Tutorial 6 (MHA)

phlippe / uvadlc_notebooks

Repository of Jupyter notebook tutorials for teaching the Deep Learning Course at the University of Amsterdam (MSc AI), Fall 2023

https://uvadlc-notebooks.readthedocs.io/en/latest/

MIT License

2.46k stars 561 forks source link

Specifying the mask in Tutorial 6 (MHA) #46

Closed StolkArjen closed 2 years ago

StolkArjen commented 2 years ago

Tutorial: -1 (6)

Describe the bug This is more of a clarification question than a bug. First of all, thanks for the excellent tutorial documentation. It's been very clear overall.

The reason I'm reaching out is to ask if a little more explanation could be provided on how and where to insert and apply the key padding mask to the attention_weights. Specifically, I have a Tensor of the form [True True True False False] for every sequence in the batch ([Batch, SeqLen]), with False marking padding tokens.

However, scaled_dot_product shown below wants the mask to have the following dimensions: [Batch, Head, SeqLen, SeqLen]. To this end, I have simply expanded key padding mask in the row dimension (using, key_padding_mask.view(bsz, 1, 1, seqlen).expand(-1, num_heads, seqlen, -1)), yielding the following square [SeqLen, SeqLen] mask for a sequence:

[[True True True False False], [True True True False False], [True True True False False], [True True True False False], [True True True False False]]

I do this somewhere upstream, in the forward definition of TransformerPredictor. Next, the same mask is fed all the way down to scaled_dot_product where it is then used to mask out False tokens, rendering the attn_logits -9e15 where there used to be a False. However, in contrast to a previous attempt using length-normalized sequences, the model does not manage to learn. This makes me wonder whether the above implementation is not how it was meant to be designed. Am I missing anything important here?

def scaled_dot_product(q, k, v, mask=None): d_k = q.size()[-1] attn_logits = torch.matmul(q, k.transpose(-2, -1)) attn_logits = attn_logits / math.sqrt(d_k) if mask is not None: attn_logits = attn_logits.masked_fill(mask == 0, -9e15) attention = F.softmax(attn_logits, dim=-1) values = torch.matmul(attention, v) return values, attention

phlippe commented 2 years ago

Hi @StolkArjen, you are using the mask for padding as intended. It is surprising to me that this prevents your model from learning, since in my quick checks a few minutes ago the code seems to work as wanted with the paddings. What task are you trying to solve? A classification task or a sequence-to-sequence prediction task?

StolkArjen commented 2 years ago

Hi @phlippe, thanks for the swift response and confirming my approach. This is indeed a classification task, where the model is trying to classify a sequence of movements on a 2D digital game board based on structure within the sequence.

Your response motivated me to try a number of different representations of the data. It turns out the model learns fine, with the help of pad masking, when the varying length sequences are continuous timepoints that are regularly spaced in time (scenario 1).

scenario 1 [[x, y (0 ms)], [x, y (10 ms)], [x, y (20 ms)], ... and so on, where x and y can have the same value across multiple timepoints (spaced 10 ms apart, though the model doesn't know that)

However, if the sequences are organized into more discrete timestamps, with every timestamp representing a single movement and accompanied by a time feature (scenario 2), the model cannot learn the structure.

scenario 2 [[400, x, y], [250, x, y], [300, x, y], ... and so on, where x and y change with every movement and the first feature indicates the time spent at the corresponding location on the game board

Both data representation types require padding-based masking, leaving the continuous vs. discrete representation as the only difference between the two. I also tried turning off positional encoding for scenario #2, but to no avail.

In sum, your implementation works fine (sorry for any possible confusion), and our implementation might need some timeseries-specific modification to work in scenario 2. If you happen to have any intuition on this, I'd be happy to try things out.

StolkArjen commented 2 years ago

I'll leave these code snippets here in case anyone is interested in using the mask:

in the Dataset class:

def _pad_sequence(self):
    """
    Pads the sequences to maximum length and creates a Boolean mask with False in place of padding tokens.
    The False tokens will be suppressed in scaled_dot_product 
    """
    lengths        = [len(row) for row in self.data]
    varlen         = np.std(lengths)>0 # True of False
    pad_idx        = 99999
    if varlen:
        maxlen         = max(lengths)
        nfeats         = self.data[0].size(1)
        print("Padding sequences to a length of " + str(maxlen) + " tokens")
        for t, d in enumerate(self.data):
            numpads        = maxlen - len(self.data[t])  
            padding        = torch.Tensor([pad_idx]).repeat(numpads, nfeats)
            self.data[t]   = torch.cat((self.data[t], padding))
        self.key_padding_mask = torch.tensor(([[s[-1] != pad_idx for s in seq] for seq in self.data])) # [Batch, SeqLen]

Before calling the trainer:

if hasattr(dataset, 'key_padding_mask'):
    mask = dataset.key_padding_mask.view(len(train_set.indices), 1, 1, maxlen).expand(-1, nheads, maxlen, -1) # [Batch, Head, SeqLen, SeqLen]
else:
    mask = None

phlippe commented 2 years ago

Hi @StolkArjen, thanks for sharing more details and your code for padding the sequences! In your scenario 2, how do you put the time as feature into the model? Not sure if you have already done it, but remember to normalize each feature (mean 0 and std 1), since the model's initialization expect all features to be in that range. Huge values like 400 would potentially create issues in the attention blocks and overpower any signal coming from $x$ and $y$. Alternatively you could try to change the position encoding to encoding the overall time step in the time series instead of the element position in the sequence. Depending on your maximum length, you might need to finetune the position encoding formula (the div_term in the module PositionalEncoding).

StolkArjen commented 2 years ago

Awesome, normalizing the time features solved it! Thanks once more for your support, @phlippe. The model is behaving as expected

StolkArjen commented 1 year ago

Hi @phlippe,

I hope this message finds you well. Your support in the past has allowed us to successfully train sequence classifiers for a while now. Given this, I wondered if we could probe your thoughts on a new challenge we are facing.

Specifically, our movement sequences are part of a larger interaction sequence. We have been treating each movement sequence as independent, but now recognize that what happens in one movement sequence may be dependent on what occurred in preceding sequences. We are therefore exploring the possibility of training our model on the entire interaction sequence. The main challenge, however, is how to organize the labels for this purpose.

An interaction sequence consists of 80 movement sequences of variable length, each with a corresponding label. One possibility we are considering is to organize these labels for sequence-to-sequence prediction. This approach involves creating a single, interpolated label vector with the same length as the interaction sequence. Another approach would entail classifying the movement sequences in the interaction sequence using multiple, parallel loss functions.

As you can see, we are unsure of how to proceed and would greatly appreciate any thoughts or advice you might have on this matter.

Best regards, Arjen

StolkArjen commented 1 year ago

Hi @phlippe,

Just checking in to see if you've had a moment to look at this. No worries if not. We would much appreciate your thoughts.

Best, Arjen

phlippe commented 1 year ago

Hi @StolkArjen, sorry for the delay, I didn't see your first post and just got to it now. Let me first verify that I fully understand your setup. Each interaction sequence is a sequence of 80 movement sequences with variable length. In other words, you have a sequence-of-sequences. As labels, you have one label per movement sequence. Thus, you have 80 labels per interaction sequence. You want information from movement sequences to be used for movement sequences that come later in the interaction sequence. Is that correct?

In this case, I would first consider the dependency you expect between the movement sequences:

Do you expect that some global information of the first movement sequence is useful to classify the remaining sequences? In the first case, putting everything together in one sequence seems quite expensive, since transformers scale quadratically with the sequence length. I would instead follow previous approaches on document classification, which used two levels of attention: one within each individual movement sequence, and one across the movement sequences, e.g. between the classification tokens across the movement sequences (if needed, you can use a causal mask there). This way, you keep the computational cost manageable while allowing for information exchange between movement sequences.
Instead, do you expect that an individual time step in the first movement sequence to be relevant in the second or later sequences? If so, putting everything together in one sequence seems the most reasonable. You can then use a sequence-to-sequence loss on the classification tokens.

Let me know if you have any follow-up questions. Hope that helps a bit!

Best, Phillip

ssadhukha commented 1 year ago

Hi @phlippe, I'm @StolkArjen 's PhD student, working on this project. Thanks for your response and for sharing your ideas re: our modeling strategy. First, re: your first suggestion, yes we do expect to see that some global information of say, the first movement sequence is useful to classify the remaining sequences. We expect these dependencies to build over time. We’ll certainly take a look at how document classification, using two levels of attention is implemented and how it can be adapted for our purposes. So far, we have been treating each trial as our unit of analysis, so there isn’t any cross-trial dependencies in our modeling strategy (yet).

Re: your second suggestion, yes, we do expect that the particular patterns of movements that are revealed by individual time-steps a as whole (as in, across the sequence) might be relevant for later sequences, but this is something we have yet to explore.

Ultimately, because we anticipate that behavior in future trials is highly dependent on past ones, we’re interested in modeling the unique dynamics in the game and how the trial-by-trial dependencies change over time.

Thanks again for your input!