SDR performance regression since version 1.0.0

sevagh commented 2 years ago

Hi all,

So, when I evaluated UMX and X-UMX (to compare with my model, xumx-sliCQ), in my MDX published paper, I reported the following median SDR values:

Our model, xumx-sliCQ, was trained on MUSDB18-HQ. On the test set, xumx-sliCQ achieved a median SDR of 3.6 dB versus the 4.64 dB of UMX and 5.54 dB of X-UMX, https://mdx-workshop.github.io/proceedings/hanssian.pdf

I noticed this wasn't aligning with published performance of UMX/X-UMX (in the X-UMX paper):

Table 1: Comparison of X-UMX with other public methods in terms of SDR (“median of frames, median of tracks”). Avg. UMX [16] 5.41 X-UMX (proposed) 5.79 https://arxiv.org/pdf/2010.04228.pdf

UMX is supposed to be closer to X-UMX, not almost 1 dB SDR apart.

The way I generate my old (seemingly incorrect) values of 4.64 dB for UMX and 5.54 dB for X-UMX was using the standard sigsep tools to loop over MUSDB18-HQ (50 test tracks), store the evaluation json file, and then aggregate using the aggregation + boxplot scripts:

https://gitlab.com/sevagh/xumx_slicq_extra/-/blob/main/mss_evaluation/mss-oracle-experiments/oracle_eval/trained_models.py#L152-178

Then I added this script (from the sisec 2018 repo):

https://gitlab.com/sevagh/xumx_slicq_extra/-/blob/main/mss_evaluation/mss-oracle-experiments/oracle_eval/compare_json_results.py

The X-UMX evaluation, using the same set of json files, gives me an SDR of 5.91 dB (higher than 5.54) because the compare_json_results.py gives me a different median (across targets + frames) than 5.54.

The UMX evaluation of the old release corresponding to the published paper (https://github.com/sigsep/open-unmix-pytorch/releases/tag/v1.0.0) got much better results, leading to a median SDR of 5.41 dB (exactly as the X-UMX publication).

The UMX evaluation of the latest tip of master gives me the median SDR significantly lower than 5.41 dB. I think that the new code of UMX has a performance regression.

faroit commented 2 years ago

@sevagh interesting, we have a regression test that did compare the SDR against vs. 1.0, so that at least on that one track it doesn't seem to have changed.

did you make sure that the separator settings are the same in your code?

faroit commented 2 years ago

@sevagh did you manage to figure out what the issue is?

sevagh commented 2 years ago

Forgot to follow up - I will try to replicate later today.

sevagh commented 2 years ago

The regression test uses the 7s MUSDB18 excerpt (which may not even be in the wav format). I wonder if that's related.

faroit commented 2 years ago

I think it's because you evaluated on HQ as opposed to the normal musdb18.

faroit commented 5 months ago

@sevagh i think we can close this. I had to lower the regression test tolerance in #144 :-/

sigsep / open-unmix-pytorch

SDR performance regression since version 1.0.0 #114