input length of short-chunk CNN

Hi,

In most music tagging research, it is known that training with short audio excerpts performs better than training with the entire (30 seconds in MTAT or MSD) sequence. When we train a model with an instance level, we end up having more training examples (we will get x10 examples if we use 3-sec excerpts from 30-sec audio). The task may become more difficult because the model is given less amount of information. In practice, sometimes this ends up obtaining a more robust music tagging model, probably due to the higher stochasticity.

minzwon / sota-music-tagging-models

input length of short-chunk CNN #13