Closed da-hon5 closed 2 years ago
Hi,
In most music tagging research, it is known that training with short audio excerpts performs better than training with the entire (30 seconds in MTAT or MSD) sequence. When we train a model with an instance level, we end up having more training examples (we will get x10 examples if we use 3-sec excerpts from 30-sec audio). The task may become more difficult because the model is given less amount of information. In practice, sometimes this ends up obtaining a more robust music tagging model, probably due to the higher stochasticity.
hi, why is the input length of the short-chunk CNN exactly 59049 samples or 3.69 seconds? thanks in advance, hannes