Open VishwaasHegde opened 4 years ago
I don't have access to the dataset at the moment, but the dataset was not from the RWC dataset itself but re-synthesized vocal tracks as described in the pYIN paper, in a similar manner as the MDB-stem-synth dataset. We obtained the resynthesized files from the authors, and its labels contained continuous frequency annotations.
Thanks, also since you are taking just one pitch output for a frame, why are you taking 'sigmoid' activation in Dense(360, activation='sigmoid', name="classifier")
for the output, would 'softmax' be a better option? I believe sigmoid is usually used for multi label classification, whereas this is multi-class classification
It's one of the tricks used for the approach, which is not quite orthodox for classification tasks in ML - it also uses binary cross entropy with soft labels, whereas the labels are usually one-hot in classification models.
We found that this combination (binary cross entropy with soft label) worked more robustly on pitch estimation, combined with the decoding heuristics taking the weighted average of activations near argmax.
Thanks for the info. May I ask how you were able to obtain soft labels? Was it labelled that way in the data itself? I have a similar dataset that has hard pitch frequency labels. The only way I can think of taking soft labels is by having a Gaussian around each pitch frequency and with a standard deviation of 5-10 cents
The labels I had contained Hz values (that doesn't necessarily align with semitone intervals), from which I calculated the soft labels using a Gaussian-shaped curve with a standard deviation of 25 cents. You can find an example code in the comments of this issue.
Hello, First of all thanks for the amazing paper and the repo !! I have a basic doubt, the RWC Dataset says that annotated data is at semi-tone intervals, that is 50 cents. How is CREPE able to predict with 10 or 20 cent intervals?