visipedia / ssw60

Sapsucker Woods 60 Audiovisual Dataset
MIT License
14 stars 2 forks source link

Inquiry about the audio augmentation and evaluation setting for audio-vidual mid-fusion #1

Open Rick-Xu315 opened 2 years ago

Rick-Xu315 commented 2 years ago

Hi, thanks for your work on AV FGC task. I'd like to inquire about some experiment details in your paper:

  1. In 4.1-Audio-Modality in your paper, you use logit average as evaluation strategy, but in 4.1-Audio-Visual-fusion part, you might not use the same averaging strategy for mid-fusion because they conflict, and I'd like to know how you process the audio in mid-fusion part. Do you use only one spectrogram in the mid-fusion part? If so, is the uni-modal and multi-modal result comparable?
  2. In mid-fusion part you use MBT as a SOTA fusion method, I'd like to know do you ever try on other simple mid-fusion method like concatenation、summation or Gated.
  3. For audio agmentation method, is it possible that you could include relevant code when you release your experiment pipeline. I would appreciate that.
rui1996 commented 2 years ago

Thank you for your interest in our work! Regarding your questions:

  1. It is true that for multimodal fusion, one video input can only interact with one audio spec input. However, we could adopt multiple views for one video and multiple views for one audio spec. We fuse a pair of them each time and finally average the logits. This is what we do and we consider it is comparable.
  2. Since we are using transformer as the backbone, MBT or token concatenation would be the most straightforward way. In MBT paper, it shows concatenation is not as good as MBT, so we just adopt MBT.
  3. Yes we will release the code. For your question, the code looks like this:
def freq_masking(self, img, freq_factor=1.0, mask_len=15):
        factor = np.random.RandomState().rand()
        freq_len = img.shape[0]
        if factor <= freq_factor:
            start = np.random.randint(0, freq_len - mask_len)
            interval = np.random.randint(0, mask_len)
            img[start : start + interval, :] = 0
        return img

def time_masking(self, img, time_factor=1.0, mask_len=15):
        factor = np.random.RandomState().rand()
        time_len = img.shape[1]
        if factor <= time_factor:
            start = np.random.randint(0, time_len - mask_len)
            interval = np.random.randint(0, mask_len)
            img[:, start : start + interval] = 0
        return img
Rick-Xu315 commented 2 years ago

Thanks for your reply! I would also like to ask another question: In your paper you mention in table 3 that you get a lower performance of audio resnet 18 after finetuning on video-audio. I find similar result after we finetune the concat-based av model composed of pretrained unimodels and linear probe the audio backbone. I would like to know your opinions why the audio backbone gets worse after finetuning. Great thanks!