sigsep / bsseval

audio source separation evaluation metrics
MIT License
27 stars 11 forks source link

Do not exclude segments where one/several estimates are all-zero #4

Open StefanUhlich-sony opened 6 years ago

StefanUhlich-sony commented 6 years ago

Currently, segments where one/several estimates are all-zero are not considered for the BSSEval computation

https://github.com/sigsep/sigsep-mus-eval/blob/05d52e4962660417801b78aa82ac598dd8c7b25a/museval/metrics.py#L300

This leads to the effect that the SDR value, which is defined for the jth instrument as

SDR_j = 20\log_10 ( \sum_i,n s_{ij}(n)^2 ) / ( \sum_i,n (s_{ij}(n) - \hat s_{ij}(n))^2 )

depends on the other estimates \hat s_{ik}(n) for k \ne j. Here is a quick example that shows the effect:

import musdb
import museval

import numpy as np

def estimate_and_evaluate1(track):
    """ Simple baseline system using mixture as estimate """
    estimates = {}
    estimates['vocals'] = 0.25 * track.audio
    estimates['accompaniment'] = 0.75 * track.audio

    scores = museval.eval_mus_track(track, estimates, output_dir='.')
    print('Score for `estimate_and_evaluate1`:')
    print(scores)

    return estimates

def estimate_and_evaluate2(track):#
    """ Modified baseline system, which sets the second half of `vocals` to zero """
    estimates = {}
    estimates['vocals'] = 0.25 * track.audio
    estimates['accompaniment'] = 0.75 * track.audio

    estimates['vocals'] *= np.vstack((np.ones((track.audio.shape[0] // 2, 2)),
                                      np.zeros((track.audio.shape[0] - track.audio.shape[0] // 2, 2))))

    scores = museval.eval_mus_track(track, estimates, output_dir='.')
    print('Score for `estimate_and_evaluate2`:')
    print(scores)

    return estimates

def estimate_and_evaluate3(track):#
    """ Modified baseline system, which sets the first half of `vocals` to zero """
    estimates = {}
    estimates['vocals'] = 0.25 * track.audio
    estimates['accompaniment'] = 0.75 * track.audio

    estimates['vocals'] *= np.vstack((np.zeros((track.audio.shape[0] // 2, 2)),
                                      np.ones((track.audio.shape[0] - track.audio.shape[0] // 2, 2))))

    scores = museval.eval_mus_track(track, estimates, output_dir='.')
    print('Score for `estimate_and_evaluate3`:')
    print(scores)

    return estimates

mus = musdb.DB(root_dir='/speech/db/mul/separ4/sisec/data2018/', is_wav=True)
mus.run(estimate_and_evaluate1, estimates_dir=".", tracks=[mus.load_mus_tracks(subsets='test')[0]])
mus.run(estimate_and_evaluate2, estimates_dir=".", tracks=[mus.load_mus_tracks(subsets='test')[0]])
mus.run(estimate_and_evaluate3, estimates_dir=".", tracks=[mus.load_mus_tracks(subsets='test')[0]])

estimate_and_evaluate* are three simple systems that uses the mixture as estimate. Only vocals is modified for the different versions but also the BSSEval values for accompaniment are changed:

$ python separ_and_evaluate.py 
  0%|                                                                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Score for `estimate_and_evaluate1`:
vocals              => SDR:-10.161dB, SIR:-16.848dB, ISR:2.421dB, SAR:28.828dB, 
accompaniment       => SDR:6.991dB, SIR:12.551dB, ISR:11.751dB, SAR:28.828dB, 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:25<00:00, 85.16s/it]
  0%|                                                                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Score for `estimate_and_evaluate2`:
vocals              => SDR:-12.816dB, SIR:-15.727dB, ISR:0.177dB, SAR:-1.699dB, 
accompaniment       => SDR:7.181dB, SIR:14.078dB, ISR:11.783dB, SAR:27.795dB, 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:11<00:00, 71.51s/it]
  0%|                                                                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Score for `estimate_and_evaluate3`:
vocals              => SDR:-7.410dB, SIR:-11.257dB, ISR:0.695dB, SAR:2.519dB, 
accompaniment       => SDR:6.783dB, SIR:10.938dB, ISR:11.722dB, SAR:29.830dB, 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:15<00:00, 75.29s/it]

@faroit @aliutkus What do you think? Should this be changed for a future version of BSSEval?

faroit commented 6 years ago

I remember we did this on purpose, but can't remember why we did that. Maybe @aliutkus can jump in here?