Closed calebkoy closed 6 years ago
This looks entirely correct. Don't forget that you can use the optional --csv
flag for getting per file results and the --verbose
flag, which will write out a comparison file for each track that details what's happening for each time segment (much like the tables you posted above).
Thank you very much!
Hi, @jpauwels.
I've just read the MIREX ACE evaluation details (on their web page) and your paper on evaluating automatically estimated chord sequences, and I want to use this issue to check that I correctly understand how model evaluation works in the case of the SeventhsBass (SB) vocab, and to check that I've correctly used MusOOEvaluator to evaluate predictions outputted by my neural net.
I'll use a fictional toy example to check that I correctly understand the former:
Say we have two songs S1 and S2. S1 lasts for 8 seconds and S2 lasts for 6 seconds, and each song has five chords (and so five annotated segments). The chord start and end times (in seconds), ground truth (GT) labels and labels estimated by our ACE system are shown below for each song.
Let's say our ACE system is a neural net whose output layer has 217 neurons (corresponding to the 217 possible chord labels [including 'no chord'] that can be assigned under the SB vocab). And let's say that the value outputted by each neuron in the output layer of our neural net gives the probability that the chord label for the current input is the chord label corresponding to that neuron. Then I think it's correct to say that the neural net can only predict chord labels that are part of the SB vocab. (I hope that makes sense. Please let me know if anything doesn't make sense or if you think that statement is incorrect!)
So, since both the GT and the estimated labels conform to the SB vocab, all we need to do to score each prediction in the tables above is to assign a 1 if the estimated label exactly matches the GT label, and a 0 otherwise. Is that correct?
Assuming that is correct, the CSR scores for S1 and S2 are 2/8 (25%) and 4/6 (67%), respectively (total duration of correctly estimated segments divided by total duration of annotated segments). Then the WCSR for this simple example is calculated as follows:
WCSR = [(8 * 0.25) + (6 * 0.66666)]/(8+6) = 6/14 = 3/7,
where WCSR is given by the following formula (taken from Junqi Deng's thesis, page 60):
Have I correctly calculated the scores and are my assumptions correct, or is there anything I'm not understanding correctly?
Secondly, I've just used MusOOEvaluator to evaluate the predictions that my own neural net (a CNN) has outputted for a single song. I ran the command
MusOOEvaluator.exe --reffile C:\Users\caleb\Desktop\University\CS344\anjing.txt --testfile C:\Users\caleb\Desktop\University\CS344\anjing_predictions.txt --chords MirexSeventhsBass --output C:\Users\caleb\Desktop\University\CS344\output_anjing_WCSR.txt
and inspected the output file. Everything seems fine and my understanding is that the command I ran has calculated the WCSR score for my predictions, under the SB vocabulary. Is that correct?
Thanks in advance for your help!