sigsep / sigsep-mus-eval

museval - source separation evaluation tools for python
https://sigsep.github.io/sigsep-mus-eval/
MIT License
203 stars 36 forks source link

When evaluating with accompaniment.wav, voice target is evaluated twice with two different SDRs #65

Open adefossez opened 5 years ago

adefossez commented 5 years ago

When evaluating a a folder containing bass.wav, drums.wav, other.wav, voice.wav and accompaniment.wav, the voice target will be evaluated twice: once for the set of sources [drums, bass, other, voice] and once for [voice, accompaniment].

The second evaluation will overwrite the first one for the voice target in the json files. One would expect the SDR definition to not dependent on the other sources. However this is not the case, as the filters are computed using the cross correlations between all available sources (4 in the first case, 2 in the second). I observed that the second evaluation obtain consistently higher SDR score for voice, in my case around 0.2 SDR.

This can lead to unfair comparison between models or hard to reproduce results if one export the accompaniment.wav file or not. For instance in the SiSec evaluation campaign, the json for the MMDenseLSTM model (TAK2) contains an accompaniment entry, showing that its vocal metrics were overwritten. On the other hand, the OpenUnmix model did not export this file and thus will obtain a worse SDR for vocals.

While the difference is not huge, I opened this issue to verify that it is normal that the SDR depends on the other sources and not just on the current source estimate, and also to see if this behavior should be documented.

As an example, one can use the wav available in this Dropbox folder. I also included the json files. They were generated as

museval --musdb PATH_TO_MUSDB -o evals/without_accompaniment without_accompaniment
museval --musdb PATH_TO_MUSDB -o evals/with_accompaniment with_accompaniment

where the folder without_accompaniment did not export the accompaniment.wav file and with_accompaniment contained one that is equal to the sum of bass, other and drums.

Then running

from museval import EvalStore
es_with = EvalStore()
es_with.add_eval_dir("./evals/with_accompaniment/")
es_without = EvalStore()
es_without.add_eval_dir("./evals/without_accompaniment/")
print(es_with)
print(es_without)

one obtain

Aggrated Scores (median over frames, median over tracks)
drums           ==> SDR:   3.257  SIR:   9.204  ISR:   4.070  SAR:   3.586
bass            ==> SDR:  -0.185  SIR:  -0.593  ISR:   8.657  SAR:   8.025
other           ==> SDR:   5.760  SIR:   8.622  ISR:  13.534  SAR:   6.595
vocals          ==> SDR:   9.305  SIR:  19.721  ISR:  13.324  SAR:   9.268
accompaniment   ==> SDR:  13.800  SIR:  18.317  ISR:  23.927  SAR:  14.181

Aggrated Scores (median over frames, median over tracks)
vocals          ==> SDR:   9.132  SIR:  18.688  ISR:  13.443  SAR:   9.097
drums           ==> SDR:   3.257  SIR:   9.204  ISR:   4.070  SAR:   3.586
bass            ==> SDR:  -0.185  SIR:  -0.593  ISR:   8.657  SAR:   8.025
other           ==> SDR:   5.760  SIR:   8.622  ISR:  13.534  SAR:   6.595
faroit commented 5 years ago

when you call museval from the cli you are running the eval_dir function. What you want instead is to use the SiSEC MUS task-like scenario functions eval_mus_track or eval_mus_dir which contain specific treatment of the accompaniment.

adefossez commented 5 years ago

@faroit this specific treatment is exactly what is the source of confusion. It will silently override the vocals metrics from the 4 sources scenario and replace them with slightly better metrics from the 2 sources scenario, leaving the other 3 untouched.

As far as I can see, museval from the cli does not call eval_dir. Given the entry point in setup.py and the museval function which calls eval_mus_dir, which then call _load_track_estimates and finally eval_mus_track, which will have the described behavior.

aliutkus commented 5 years ago

I wonder why the SDR is impacted