thuiar / Cross-Modal-BERT

CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis(MM2020)
103 stars 25 forks source link

Question regarding sequence lengths and padding masks #19

Open raresionut1 opened 1 year ago

raresionut1 commented 1 year ago

Hello, first of all congratulations on your paper; amazing work! I am currently trying to adapt your masked multi modal attention module for multimodal dialog act classification and I have a few questions regarding your architecture.

As written in the paper in Section 4.2, the sequence lengths of audio features are not the same as the ones of text features. I assume this is the case because Bert also takes punctuation marks and other special tokens as inputs in the sequence. Analyzing your implementation of the masked multimodal Attention (found in BertFinetun in your code), I see that you use a padding mask, which I backtracked and saw that it regards only the text data and has the shape (batch_size, 1, max_seq_length).

Now my question is, since the audio features have smaller sequence lengths and were padded with additional zero vectors to match the sequence lengths of the text features, why don't you use a separate padding mask for the audio features?

paraanggotaforum commented 1 year ago

Hi there @raresionut1 , sorry to bother you in this issue. I am currently trying to use this code for classification, may I know how you transformed the code to classification as I don't seem to have reached a good accuracy for it. Thank you!