Question regarding sequence lengths and padding masks

Hello, first of all congratulations on your paper; amazing work! I am currently trying to adapt your masked multi modal attention module for multimodal dialog act classification and I have a few questions regarding your architecture.

As written in the paper in Section 4.2, the sequence lengths of audio features are not the same as the ones of text features. I assume this is the case because Bert also takes punctuation marks and other special tokens as inputs in the sequence. Analyzing your implementation of the masked multimodal Attention (found in BertFinetun in your code), I see that you use a padding mask, which I backtracked and saw that it regards only the text data and has the shape (batch_size, 1, max_seq_length).

Now my question is, since the audio features have smaller sequence lengths and were padded with additional zero vectors to match the sequence lengths of the text features, why don't you use a separate padding mask for the audio features?

thuiar / Cross-Modal-BERT

Question regarding sequence lengths and padding masks #19