nttcslab / byol-a

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
https://arxiv.org/abs/2103.06695
Other
205 stars 35 forks source link

Question about comments in the train.py #12

Closed ChenyangLEI closed 2 years ago

ChenyangLEI commented 3 years ago

https://github.com/nttcslab/byol-a/blob/master/train.py

At line 67, there is comments for the shape of input.

        # in fact, it should be (B, 1, F, T), e.g. (256, 1, 64, 96) where 64 is the number of mel bins
        paired_inputs = torch.cat(paired_inputs) # [(B,1,T,F), (B,1,T,F)] -> (2*B,1,T,F)

image

However, it is different from the descriptions in config.yml file

# Shape of loh-mel spectrogram [F, T].
shape: [64, 96]
Sreyan88 commented 3 years ago

Hi @ChenyangLEI / @daisukelab ,

I have a similar question, the norm in the acoustic world is to use [T,F], however, BYOL-A uses [F,T]. Any specific reason?

daisukelab commented 3 years ago

Hi @ChenyangLEI, Thank you for sharing the issue, It's my fault. As you might know, the comment in train.py is wrong. config.yaml is correct.

Hi @Sreyan88, I didn't aware that the [F, T] order is against the convention. It's basically following the output feature shape of the byol_a.dataset.MelSpectrogramLibrosa. Thanks for sharing this issue with me. I'd like to switch to it in the future. :)