minzwon / sota-music-tagging-models

MIT License
397 stars 64 forks source link

input length of short-chunk CNN #13

Closed da-hon5 closed 2 years ago

da-hon5 commented 2 years ago

hi, why is the input length of the short-chunk CNN exactly 59049 samples or 3.69 seconds? thanks in advance, hannes

minzwon commented 2 years ago

Hi,

In most music tagging research, it is known that training with short audio excerpts performs better than training with the entire (30 seconds in MTAT or MSD) sequence. When we train a model with an instance level, we end up having more training examples (we will get x10 examples if we use 3-sec excerpts from 30-sec audio). The task may become more difficult because the model is given less amount of information. In practice, sometimes this ends up obtaining a more robust music tagging model, probably due to the higher stochasticity.