sigsep / open-unmix-pytorch

Open-Unmix - Music Source Separation for PyTorch
https://sigsep.github.io/open-unmix/
MIT License
1.23k stars 180 forks source link

[Question] Ideal/oracle performance of source estimate + mix phase #83

Open sevagh opened 3 years ago

sevagh commented 3 years ago

Hello, I've been interested in running various oracle benchmark methods to check if different types of spectrogram (CQT, etc.) can be useful for source separation. Initially, I was working with the IRM1/2 and IBM1/2 from https://github.com/sigsep/sigsep-mus-oracle

However I noticed that Open-Unmix uses the strategy of "estimate of source magnitude + phase of original mix" (but it has an option to use soft masking instead). Is it valuable to create an "oracle phase-inversion" method?

So, soft mask/IRM1 "ceiling" of performance (the known IRM1 oracle mask calculation) is like (using vocals stem as an example):

mix = <load mix>                          # mixed track
vocals_gt = <load vocals stem>   # ground truth

vocals_irm1 = abs(stft(vocals_gt)) / abs(stft(mix))

vocals_est = istft(vocals_irm1 * stft(mix)) # estimate after "round trip" through soft mask

Now, for the phase inversion method, we could do the following:

mix = <load mix>                          # mixed track
vocals_gt = <load vocals stem>   # ground truth

mix_phase = phase(stft(mix))
vocals_gt_magnitude = abs(stft(vocals_gt))

vocals_stft = pol2cart(vocals_gt_magnitude, mix_phase)

vocals_est = istft(vocals_stft)  # estimate after "round trip" through phase inversion

Does this make sense to do? Has anybody done this before? What could this method be called?

sevagh commented 3 years ago

OK, it seems to be working. Here's a piece of code, hacked together from https://github.com/sigsep/sigsep-mus-oracle/blob/master/IRM.py and unmix:

def atan2(y, x):
    r"""Element-wise arctangent function of y/x.
    copied from umx, replace torch with np
    """
    pi = 2 * np.arcsin(1.0)
    x += ((x == 0) & (y == 0)) * 1.0
    out = np.arctan(y / x)
    out += ((y >= 0) & (x < 0)) * pi
    out -= ((y < 0) & (x < 0)) * pi
    out *= 1 - ((y > 0) & (x == 0)) * 1.0
    out += ((y > 0) & (x == 0)) * (pi / 2)
    out *= 1 - ((y < 0) & (x == 0)) * 1.0
    out += ((y < 0) & (x == 0)) * (-pi / 2)
    return out

def ideal_mixphase(track, eval_dir=None):
    """
    ideal performance of magnitude from estimated source + phase of mix
    which is the default umx strategy for separation
    """

    X = stft(track.audio.T, nperseg=4096, noverlap=1024)[-1].astype(np.complex64)

    (I, F, T) = X.shape

    # Compute sources spectrograms
    P = {}
    # compute model as the sum of spectrograms
    model = eps

    # parallelize this
    for name, source in track.sources.items():
        # compute spectrogram of target source:
        # magnitude of STFT
        src_coef = stft(source.audio.T, nperseg=4096, noverlap=1024)[-1].astype(np.complex64)

        P[name] = np.abs(src_coef)

        # store the original, not magnitude, in the mix
        model += src_coef

    # now performs separation
    estimates = {}
    accompaniment_source = 0
    for name, source in track.sources.items():
        source_mag = P[name]

        # get mix phase/angle
        mix_phase = atan2(model.imag, model.real)

        # use source magnitude estimate + mix phase
        Yj = np.multiply(source_mag, np.cos(mix_phase)) + 1j*np.multiply(source_mag, np.sin(mix_phase))

        # invert to time domain
        target_estimate = istft(Yj, nperseg=self.nperseg, noverlap=self.noverlap)[1].T[:self.N, :].astype(np.float32)

        # set this as the source estimate
        estimates[name] = target_estimate

        # accumulate to the accompaniment if this is not vocals
        if name != 'vocals':
            accompaniment_source += target_estimate

    estimates['accompaniment'] = accompaniment_source

    bss_scores = museval.eval_mus_track(
        track,
        estimates,
        output_dir=eval_dir,
    ).scores

    return estimates, bss_scores

The maximum SDR of the "oracle mix phase" is lower than soft masking. Is that expected?

aliutkus commented 3 years ago

it's a very interesting idea, I like it

could you provide numbers ? how is it behaving compared to the other oracles ?

sevagh commented 3 years ago

It's pretty underwhelming. Here is an evaluation of 4 tracks from the MUSDB18-HQ test set, with IRM1, IRM2, IBM1, IBM2, and the new one, "MPI" (Mixed Phase Inversion), with the Open-Unmix STFT settings (window = 4096, hop = 1024): image

sevagh commented 3 years ago

Open-Unmix is not the first time I've seen the source estimate magnitude + mix phase inversion. It's also used in the CDAE source separation algorithm (https://arxiv.org/abs/1703.08019) but I'm still curious why it is preferred to soft masking.

I will upload my code to generate the above results (it mostly just wraps sigsep tools) in a cleanly reproducible separate repo so I can link it here. I might be doing something wrong in my code somewhere.

sevagh commented 3 years ago

Here: https://github.com/sevagh/mss-oracle-experiments#oracle-performance-of-mpi-mix-phase-inversion

Apologies if there is a lot of irrelevant code (related to the NSGT), but I hope the specific part of the new "Mixed Phase Inversion" oracle makes sense and is reproducible.

sevagh commented 3 years ago

Also, I suppose SDR is not necessarily the king of metrics - we can see dramatically better ISR on the mix-phase (but that could be a consequence of its reduced separation/SDR/SIR/SAR).

Also maybe mix-phase is more "robust" to worse estimates?