nttcslab / byol-a

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
https://arxiv.org/abs/2103.06695
Other
204 stars 35 forks source link

Doubt in paper #10

Closed Sreyan88 closed 2 years ago

Sreyan88 commented 2 years ago

Hi there,

Section 4, subsection A, part 1 from your paper says:

 The number of frames, T, in one segment was 96 in pretraining, which corresponds to 1,014ms. 

However, the previous line says the hop size used was 10ms. So according to this 96 would mean 960ms?

Am I understanding something wrong here?

Thank You in advance!

daisukelab commented 2 years ago

Hi @Sreyan88, thanks again for your question. We have a hop size of 10 ms, and a window size of 64 ms. Then T=96 overlapping time frames will be: 95 * 10 ms + 64 ms = 1,014 ms. This is the calculation detail. I hope this helps.

daisukelab commented 2 years ago

Hi, I think my response answered your question, closing then. You can re-open whenever needed.