roger-tseng / av-superb

A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models (ICASSP 2024)
https://av.superbbenchmark.org/
Other
44 stars 4 forks source link

How to finetune an audio-video model like Mavil on a custom audio-only dataset? #1

Closed hellowangqian closed 3 months ago

hellowangqian commented 3 months ago

Dear authors

Thanks for sharing such great work from which I've learned a lot. I have a quesiton about finetuning. Is it possible to finetune the av model (particularly Mavil as it performs the best) on an audio-only dataset? If yes, could you please give a few instructions for doing this. Looking forward to hearing from you.

Best Q

roger-tseng commented 3 months ago

Hello,

You can consider only finetuning the audio encoder by running run_downstream.py with the -s audio_feats and --upstream_trainable options.

To use a custom dataset, you can add a folder under the downstream_tasks directory with your own code. (see this directory for a simple example)

To use data with audio only, you can just use zero tensors for video, as they will not affect the audio encoder anyway in your case. (as done in this file)

Alternatively, if you prefer using your own pipeline, you can use something like this to add the model into your pipeline. Remember to change model.eval() to model.train() for finetuning.

Lastly, the finetuned MAViL checkpoint we used in our paper is available here. It is finetuned with audio+video data on the entirety of AudioSet.

Let me know if you run into any issues! There are some details that aren't documented yet, such as the -s audio_seq_feats option for frame-level MAViL features.

hellowangqian commented 3 months ago

Thank you very much for your quick reply. The instructions you provided are very helpful, I will follow them and try them out. Just one quick question for now, what’s the difference between the checkpoint you mentioned here and the one I can download from the google drive link hardcoded in the hubconf.py file under Mavil? Which one is better in general for fine tuning on a custom audio only dataset? Thank you in advance.

发送自 Outlook for iOShttps://aka.ms/o0ukef


发件人: Yuan Tseng @.> 发送时间: Sunday, May 19, 2024 11:59:45 PM 收件人: roger-tseng/av-superb @.> 抄送: hellowangqian @.>; Author @.> 主题: Re: [roger-tseng/av-superb] How to finetune an audio-video model like Mavil on a custom audio-only dataset? (Issue #1)

Hello,

You can consider only finetuning the audio encoder by running run_downstream.py with the -s audio_feats and --upstream_trainable options.

To use a custom dataset, you can add a folder under the downstream_tasks directory with your own code. (see this directoryhttps://github.com/roger-tseng/av-superb/blob/696e269abc9d7bf7ebe4ecae6800b39a528f78ad/downstream_tasks/example for a simple example)

To use data with audio only, you can just use zero tensors for video, as they will not affect the audio encoder anyway in your case. (as done in this filehttps://github.com/roger-tseng/av-superb/blob/696e269abc9d7bf7ebe4ecae6800b39a528f78ad/downstream_tasks/kinetics_sounds/dataset.py#L111)

Alternatively, if you prefer using your own pipeline, you can use something like thishttps://github.com/roger-tseng/av-superb/blob/696e269abc9d7bf7ebe4ecae6800b39a528f78ad/extract_feats.py to add the model into your pipeline. Remember to change model.eval() to model.train() for finetuning.

Lastly, the finetuned MAViL checkpoint we used in our paper is available herehttps://speech.ee.ntu.edu.tw/~rogertseng/avsuperb/mavil_as_pt_ft_a+v.pth. It is finetuned with audio+video data on the entirety of AudioSet.

Let me know if you run into any issues! There are some details that aren't documented yet, such as the -s audio_seq_feats option for frame-level MAViL features.

― Reply to this email directly, view it on GitHubhttps://github.com/roger-tseng/av-superb/issues/1#issuecomment-2119284887, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABX467QUNOMGRQI56OXZB3LZDDD7DAVCNFSM6AAAAABH6MMLWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZGI4DIOBYG4. You are receiving this because you authored the thread.Message ID: @.***>

roger-tseng commented 3 months ago

The one in hubconf.py is self-supervised pretraining only, whereas the one I linked here is self-supervised pretraining + supervised finetuning.

We found that after supervised finetuning, results improved a lot for general audio tasks, but worsened a little for speech tasks (Table 2 of our paper).

hellowangqian commented 3 months ago

Got it, many thanks and have a nice day!

发送自 Outlook for iOShttps://aka.ms/o0ukef


发件人: Yuan Tseng @.> 发送时间: Monday, May 20, 2024 12:14:11 AM 收件人: roger-tseng/av-superb @.> 抄送: hellowangqian @.>; Author @.> 主题: Re: [roger-tseng/av-superb] How to finetune an audio-video model like Mavil on a custom audio-only dataset? (Issue #1)

The one in hubconf.py is self-supervised pretraining only, whereas the one I linked here is self-supervised pretraining + supervised finetuning.

We found that after supervised finetuning, results improved a lot for general audio tasks, but worsened a little for speech tasks (Table 2 of our paper).

― Reply to this email directly, view it on GitHubhttps://github.com/roger-tseng/av-superb/issues/1#issuecomment-2119289518, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABX467SQ7Y5KJ4S3FZPORZTZDDFVHAVCNFSM6AAAAABH6MMLWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZGI4DSNJRHA. You are receiving this because you authored the thread.Message ID: @.***>

hellowangqian commented 3 months ago

Hello, one more question about the results on AS-20K. In your paper (Table 2), the mAP values for audio, video and fusion are 28.3, 20.9 and 39.1 respectively, but in the original Mavil paper (https://arxiv.org/pdf/2212.08071), they reported 41.8, 24.8 and 44.4 respectively. Why is there such a significant gap between them (particular for the audio-only case)? Thanks.

roger-tseng commented 3 months ago

The large gap is mostly due to our simplified finetuning pipeline designed for representation benchmarking, instead of state-of-the-art performance. We are only finetuning parameters of a small prediction head v.s. original MAViL which finetunes the whole model. This can be remedied by enabling the --upstream_trainable option in run_downstream.py.

However, the original MAViL paper also applies many finetuning tricks to boost performance such as SpecAugment, mixup, balanced sampling, etc. which are not implemented here. If your goal is to boost performance on your custom dataset, you may consider starting from the finetuned MAViL checkpoint I linked above, and use it as a Pytorch model (as shown in this file) in your own training pipeline with finetuning tricks applied.

hellowangqian commented 3 months ago

Thank you very much for your constructive replies.