Question regarding evaluation

xinzuan commented 1 year ago

Hi, I run the basic-pitch/basic_pitch/experiments/run_evaluation.py from branch wip-training with MAESTRO dataset and model checkpoint from basic-pitch/saved_models/icassp_2022.

I expect the result should be similar reported in the paper. However, I got following result: {"Precision": 0.0, "Recall": 0.0, "F-measure": 0.0, "Average_Overlap_Ratio": 0.0, "Precision_no_offset": 0.04398411727609082, "Recall_no_offset": 0.029748905165349712, "F-measure_no_offset": 0.03468172982454684, "Average_Overlap_Ratio_no_offset": 0.5793096961557063, "Onset_Precision": 0.631602431674569, "Onset_Recall": 0.4181107759888922, "Onset_F-measure": 0.4925505866527016, "Offset_Precision": 0.7521021756258168, "Offset_Recall": 0.5273589516900296, "Offset_F-measure": 0.6072445448462509}.

Based on my understanding on mir_eval definition of each metrics, the one corresponding to F should be the F-measure, Fno should be F-measure_no_offset. (I cannot find the mir_eval for Acc). However ,from the upper result, you can see the result is really far from what reported in the paper.

Could anyone please tell me which mir_eval metrics corresponding to each metric in the paper?

xinzuan commented 1 year ago

When i check the value of the ref_intervals, est_intervals, it gave a really different value:

ref_interval: [[  0.98046875   1.08723958]
 [  0.99739583   1.25260417]
 [  1.09375      1.16536458]
 ...
 [384.79557292 388.55338542]
 [384.79817708 388.61067708]
 [384.80989583 388.52864583]] 
est_interval: [[387.07030113 387.36055057]
 [387.07030113 387.5475941 ]
 [387.07030113 387.5475941 ]
 ...
 [146.40265896 146.64646848]
 [362.68555193 362.89453152]
 [307.87372971 308.01433333]]

which I think this is one of the reason why the previous result is really far from what reported in the paper. After modify the functions in basic-pitch/basic_pitch/experiments/run_evaluation.py:

change minimum note length from 58.0 to 127.70 following the inconsistent minimum note length in issue #93

modify the model_inference function as follow:


def model_inference(audio_path, model, save_path,minimum_note_length=127.70):

output = run_inference(audio_path, model)

frames = output["note"]
onsets = output["onset"]
 # frames (13678, 88) onsets(13678, 88)

min_note_len = int(np.round(minimum_note_length / 1000 * (AUDIO_SAMPLE_RATE / FFT_HOP))) # add min_note len since it is required

estimated_notes = note_creation.output_to_notes_polyphonic(
    frames,
    onsets,
    onset_thresh=0.5,
    frame_thresh=0.3,
    infer_onsets=True,
    min_note_len=min_note_len, # needed in the function, it will throw error if not provided
    max_freq=None, # needed in the function, it will throw error if not provided
    min_freq=None # needed in the function, it will throw error if not provided
)
# [(start_time_seconds, end_time_seconds, pitch_midi, amplitude)]

pitch = np.array([n[2] for n in estimated_notes]) 
pitch_hz = librosa.midi_to_hz(pitch)

estimated_notes_with_pitch_bend = note_creation.get_pitch_bends(output["contour"],estimated_notes)
times_s = note_creation.model_frames_to_time(output["contour"].shape[0])

estimated_notes_time_seconds = [
    (times_s[note[0]], times_s[note[1]], note[2], note[3], note[4]) for note in estimated_notes_with_pitch_bend
]

midi = note_creation.note_events_to_midi(estimated_notes_time_seconds, save_path)

intervals = np.array([[times_s[note[0]], times_s[note[1]]] for note in estimated_notes_with_pitch_bend])

return intervals, pitch_hz,midi # add midi in the return to be used in the evaluation

3. In the function ``main``, instead of using the intervals and pitch_hz returned from the function ``model_inference``, I used:

_,,midi = model_inference(audio_path, model, save_path)

est_notes = io.load_notes_from_midi(midi = midi) if est_notes is None: est_intervals = [] est_pitches = [] else: est_intervals, estpitches, = est_notes.to_mir_eval()



I finally got the result that are close to the result reported in the paper:
{'Precision': 0.11997030494604051, 'Recall': 0.11606390831628464, **'F-measure': 0.11663329326696836**, 'Average_Overlap_Ratio': 0.8401297548289717, 'Precision_no_offset': 0.7436669014704781, 'Recall_no_offset': 0.6548245337432261, **'F-measure_no_offset': 0.6874150165838026**, 'Average_Overlap_Ratio_no_offset': 0.4262920646319229, 'Onset_Precision': 0.8259000078273144, 'Onset_Recall': 0.721544837754125, 'Onset_F-measure': 0.7601824436965499, 'Offset_Precision': 0.5818535280932536, 'Offset_Recall': 0.504137416529927, 'Offset_F-measure': 0.5329684074137423}

drubinstein commented 10 months ago

Hi @xinzuan. The training branch is still a work in progress, so don't rely on it too heavily. Regarding your issue, it's possible that there is a difference in units between the estimate, reference timestamps and frequency values and your solution took care of the difference.

spotify / basic-pitch

Question regarding evaluation #96