nttcslab / byol-a

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
https://arxiv.org/abs/2103.06695
Other
205 stars 35 forks source link

Performing evaluation with only a small part of the spectrogram #17

Closed yasar-rehman closed 1 year ago

yasar-rehman commented 1 year ago

Hi

Thank you for your contribution. It's a really interesting work. However, I have one question regarding the downstream evaluation. In the paper, you mentioned that "A segment of shape FxT was randomly cropped from each audio clip and encoded for linear evaluation in the downstream tasks."

However, as far as I know, this procedure was not adopted in the previous works. Have you tried the experiment where the complete log-mel spectrogram (without random cropping) is fed to the network during the evaluation stage? Is there any performance difference?

Thanks

daisukelab commented 1 year ago

Hi @yasar-rehman, thanks for your question.

Have you tried the experiment where the complete log-mel spectrogram (without random cropping) is fed to the network during the evaluation stage? Is there any performance difference?

Yes, I have tried that, and with a tiny performance drop as far as I can remember. And, since using a fixed duration allows batch encoding of the samples, the evaluation throughput had become much faster. Then, I decided to crop the average duration for further downstream task testing.

yasar-rehman commented 1 year ago

@daisukelab Thank you for the prompt response.

What do you mean by batch encoding? Does it mean that you cropped the audio waveform too?

daisukelab commented 1 year ago

Hi @yasar-rehman,

We use the dataset's average duration as a unit duration and randomly crop fixed-length waveforms with the dataset's unit duration. The batch encoding I wrote above is to encode this batch of the cropped waveforms. The fixed batch of waveforms with a fixed duration will have a shape of (B, L), where B is the batch size, and L is the unit duration, which the model can encode at once. If we handle waveforms with each duration, we need to encode waveforms one by one; we wanted to avoid this. P.S. Of course, the batch of waveforms is batch-preprocessed into a log-mel spectrogram before finally feeding to the model.

yasar-rehman commented 1 year ago

@ Thank you for clarifying the batch encoding.

However, I am still not sure about using random cropping for the spectrogram during the evaluation phase. what's the motivation behind using small crops from the spectrogram for the evaluation? Why not use a full spectrogram without any cropping, just like other methods for audio understanding?

Thanks

daisukelab commented 1 year ago

@yasar-rehman The reason is for faster evaluation to accommodate more experiments. And as I clarified above, I have confirmed that we could have almost the same results. As you can find in Table IV in our paper below, we evaluated various previous models. We needed a faster evaluation procedure. P.S. Please note that we had to run multiple evaluations to get statistical results. Some model takes one day or more to get results in a row.

https://ieeexplore.ieee.org/document/9944865

daisukelab commented 1 year ago

Let me close this for now. Please feel free to re-open with more questions.