nttcslab / byol-a

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
https://arxiv.org/abs/2103.06695
Other
204 stars 35 forks source link

Finetuning of BYOL-A #4

Closed mschiwek closed 3 years ago

mschiwek commented 3 years ago

Hi,

your paper is super interesting. I have a question regarding the downstream tasks. If I understand the paper correctly, you used a single linear layer for the downstream tasks which only used the sum of mean and max of the representation over time as input.

Did you try to finetune BYOL-A end-to-end after pretraining to the downstream tasks? In the case of TRILL they were able to improve the performance even further by finetuning the whole model end-to-end. Is there a specific reason why this is not possible with BYOL-A?

daisukelab commented 3 years ago

Hi @mschiwek, thank you for enjoying our paper!

Yes, I gave it some tries to finetune the BYOL-A model in an end-to-end fashion to the downstream tasks. But I didn't continue at that time. There's no technical problem with the finetuning. I just didn't proceed.

The primary reason why I didn't summarize all the results to make some nice tables is, I couldn't figure out what is the proper problem setting of the "finetuning." Previous papers (including supervised pretrainings) use each in different ways. Such as one uses data augmentations, another uses MLP as a head for downstream tasks, etc.

After opening the BYOL-A to the public, I'm working on the analysis of the BYOL-A, which includes finetuning. And we begin to understand some more things especially why ours work effectively?

I hope we can publish the next paper soon that would include our newer findings. And it hopefully answers more to your question.