Open raresionut1 opened 1 year ago
Hi there @raresionut1 , sorry to bother you in this issue. I am currently trying to use this code for classification, may I know how you transformed the code to classification as I don't seem to have reached a good accuracy for it. Thank you!
Hello, first of all congratulations on your paper; amazing work! I am currently trying to adapt your masked multi modal attention module for multimodal dialog act classification and I have a few questions regarding your architecture.
As written in the paper in Section 4.2, the sequence lengths of audio features are not the same as the ones of text features. I assume this is the case because Bert also takes punctuation marks and other special tokens as inputs in the sequence. Analyzing your implementation of the masked multimodal Attention (found in BertFinetun in your code), I see that you use a padding mask, which I backtracked and saw that it regards only the text data and has the shape (batch_size, 1, max_seq_length).
Now my question is, since the audio features have smaller sequence lengths and were padded with additional zero vectors to match the sequence lengths of the text features, why don't you use a separate padding mask for the audio features?