Evaluation results vary with volume

nx5216 commented 4 years ago

when I change the volume of estimates, the results by museval will be different, but when I use mir_eval, the results will not change. Is there any difference between museval and mir_eval?

faroit commented 4 years ago

there are differences between museval and mir_eval. If you want the same results in museval as in mir_eval, you have to use mode='v3' instead of the default.

Please tell me if that helps.

nx5216 commented 4 years ago

I have tried museval with mode='v3',but the results still change by volume,here is the results: museval: mir_eval: if i want to use museval,is it necessary to adjust the volume to a proper value? I think The method of evaluation should be independent of volume.

faroit commented 4 years ago

@aliutkus any idea?

fxmarty commented 4 years ago

Isn't it because in this call bsseval_sources_version is always set to False even when we set mode='v3'? This means, according to explanations in metrics.py that bss_eval_image will be used for the computation, which is dependent to scale but do not introduce fancy filters.

Correct me if I am wrong, but the only thing mode is changing is that, for bss_eval_source, the allowed distortion filter is the same over the whole track and not varying over time. For bss_eval_image, I am not sure what the version is changing.

If one wants to have a constant SDR no matter the scaling, there are two choices:

Use bss_eval_source e.g. from mir_eval. The latent cost is that it allows some distortions that may be unwanted, e.g. for MUSDB it makes no sense to allow distortions that are supposed to reflect the difference between the sources and microphones, when you actually just sum up the individual sources to produce the mixture (as in open-unmix).
Use SI-SDR as defined in the original 2006 paper on metrics, same as proposed in this paper. This seems to not be implemented in mir_eval, museval and bss_eval.

Something I don't really understand in the end is what "SDR" means in SiSeC 2018 and in subsequent papers. From what I understand, it means that the "raw" SDR bss_eval_image has been used. But what about the scaling? How do we know that some results weren't made artificially better just by scaling?

Hopefully from my understanding of the previous paper introducing SI-SDR, it happens that the "raw" SDR is upper-bounded for one given target and estimate. But does it mean that papers using bss_eval_image rescaled its output to maximize SDR?

fxmarty commented 4 years ago

Sorry to tag, but in case you did not see it @faroit @aliutkus

faroit commented 3 years ago

@fxmarty sorry for the late reply. As I remember, we have made the decision to make bss_eval_images the default for music tasks and we wanted to prevent people from continue using bss_eval_sources.

Something I don't really understand in the end is what "SDR" means in SiSeC 2018 and in subsequent papers. From what I understand, it means that the "raw" SDR bss_eval_image has been used. But what about the scaling? How do we know that some results weren't made artificially better just by scaling?

i guess you don't. the normal SDR is not scale invariant, and its likely that results are worse for some methods due to some bad scaling (e.g. badly reconstructive STFT).

I guess, @aliutkus has some opinions but you better contact him via mail ;-)

sigsep / sigsep-mus-eval

Evaluation results vary with volume #73