spotify / basic-pitch

A lightweight yet powerful audio-to-MIDI converter with pitch bend detection
https://basicpitch.io
Apache License 2.0
3.5k stars 275 forks source link

Question regarding evaluation #96

Open xinzuan opened 1 year ago

xinzuan commented 1 year ago

Hi, I run the basic-pitch/basic_pitch/experiments/run_evaluation.py from branch wip-training with MAESTRO dataset and model checkpoint from basic-pitch/saved_models/icassp_2022.

I expect the result should be similar reported in the paper. However, I got following result: {"Precision": 0.0, "Recall": 0.0, "F-measure": 0.0, "Average_Overlap_Ratio": 0.0, "Precision_no_offset": 0.04398411727609082, "Recall_no_offset": 0.029748905165349712, "F-measure_no_offset": 0.03468172982454684, "Average_Overlap_Ratio_no_offset": 0.5793096961557063, "Onset_Precision": 0.631602431674569, "Onset_Recall": 0.4181107759888922, "Onset_F-measure": 0.4925505866527016, "Offset_Precision": 0.7521021756258168, "Offset_Recall": 0.5273589516900296, "Offset_F-measure": 0.6072445448462509}.

Based on my understanding on mir_eval definition of each metrics, the one corresponding to F should be the F-measure, Fno should be F-measure_no_offset. (I cannot find the mir_eval for Acc). However ,from the upper result, you can see the result is really far from what reported in the paper.

Could anyone please tell me which mir_eval metrics corresponding to each metric in the paper?

xinzuan commented 1 year ago

When i check the value of the ref_intervals, est_intervals, it gave a really different value:

ref_interval: [[  0.98046875   1.08723958]
 [  0.99739583   1.25260417]
 [  1.09375      1.16536458]
 ...
 [384.79557292 388.55338542]
 [384.79817708 388.61067708]
 [384.80989583 388.52864583]] 
est_interval: [[387.07030113 387.36055057]
 [387.07030113 387.5475941 ]
 [387.07030113 387.5475941 ]
 ...
 [146.40265896 146.64646848]
 [362.68555193 362.89453152]
 [307.87372971 308.01433333]]

which I think this is one of the reason why the previous result is really far from what reported in the paper. After modify the functions in basic-pitch/basic_pitch/experiments/run_evaluation.py:

  1. change minimum note length from 58.0 to 127.70 following the inconsistent minimum note length in issue #93
  2. modify the model_inference function as follow:

    
    def model_inference(audio_path, model, save_path,minimum_note_length=127.70):
    
    output = run_inference(audio_path, model)
    
    frames = output["note"]
    onsets = output["onset"]
     # frames (13678, 88) onsets(13678, 88)
    
    min_note_len = int(np.round(minimum_note_length / 1000 * (AUDIO_SAMPLE_RATE / FFT_HOP))) # add min_note len since it is required
    
    estimated_notes = note_creation.output_to_notes_polyphonic(
        frames,
        onsets,
        onset_thresh=0.5,
        frame_thresh=0.3,
        infer_onsets=True,
        min_note_len=min_note_len, # needed in the function, it will throw error if not provided
        max_freq=None, # needed in the function, it will throw error if not provided
        min_freq=None # needed in the function, it will throw error if not provided
    )
    # [(start_time_seconds, end_time_seconds, pitch_midi, amplitude)]
    
    pitch = np.array([n[2] for n in estimated_notes]) 
    pitch_hz = librosa.midi_to_hz(pitch)
    
    estimated_notes_with_pitch_bend = note_creation.get_pitch_bends(output["contour"],estimated_notes)
    times_s = note_creation.model_frames_to_time(output["contour"].shape[0])
    
    estimated_notes_time_seconds = [
        (times_s[note[0]], times_s[note[1]], note[2], note[3], note[4]) for note in estimated_notes_with_pitch_bend
    ]
    
    midi = note_creation.note_events_to_midi(estimated_notes_time_seconds, save_path)
    
    intervals = np.array([[times_s[note[0]], times_s[note[1]]] for note in estimated_notes_with_pitch_bend])
    
    return intervals, pitch_hz,midi # add midi in the return to be used in the evaluation
3. In the function ``main``, instead of using the intervals and pitch_hz returned from the function ``model_inference``, I used:

_,,midi = model_inference(audio_path, model, save_path)

est_notes = io.load_notes_from_midi(midi = midi) if est_notes is None: est_intervals = [] est_pitches = [] else: est_intervals, estpitches, = est_notes.to_mir_eval()



I finally got the result that are close to the result reported in the paper:
{'Precision': 0.11997030494604051, 'Recall': 0.11606390831628464, **'F-measure': 0.11663329326696836**, 'Average_Overlap_Ratio': 0.8401297548289717, 'Precision_no_offset': 0.7436669014704781, 'Recall_no_offset': 0.6548245337432261, **'F-measure_no_offset': 0.6874150165838026**, 'Average_Overlap_Ratio_no_offset': 0.4262920646319229, 'Onset_Precision': 0.8259000078273144, 'Onset_Recall': 0.721544837754125, 'Onset_F-measure': 0.7601824436965499, 'Offset_Precision': 0.5818535280932536, 'Offset_Recall': 0.504137416529927, 'Offset_F-measure': 0.5329684074137423}
drubinstein commented 10 months ago

Hi @xinzuan. The training branch is still a work in progress, so don't rely on it too heavily. Regarding your issue, it's possible that there is a difference in units between the estimate, reference timestamps and frequency values and your solution took care of the difference.