Support of stereo signals during separation (throughout the process)

FSharpCSharp commented 4 years ago

Since I came across this model rather accidentally and have now roughly flown over it, I noticed that this can apparently only separate mono signals so far. From a technical point of view this is of course the much bigger challenge than a stereo signal, because all the room information is lost.

So I asked myself if it is possible to use this additional information. Would it be very difficult to extend the existing model to do so?

I find the approaches very exciting, because I have already got very good results from other models like Demucs, which also works in the waveform domain.

But this model seems to be much lighter than Demucs, for example, and the results seem to be very comparable.

The model would be much better if it could handle stereo signals. Then it would hardly be inferior to Demucs.

davda54 commented 4 years ago

Hi, thank you for your interest! The network could be extended to directly handle a stereo signal by changing the number of input/output channels from 1 to 2 in the encoder and decoder, respectively.

RadioAngurem commented 4 years ago

Hi, like FSharpCSharp I have been testing Demucs and the ConvTasnet model implemented by Demucs group for a while and I´m very impressed by the results of meta-tasnet and how fast the model compute the separated stems.

How could I change the input/output channels in the code to execute stereo separation in the google colab notebook?. Thank you.

davda54 commented 4 years ago

Hi, thanks! Please take a look at evaluate.py where we separate both channels independently. Let me know if that solves your problem :relaxed:

RadioAngurem commented 4 years ago

Thank You!!. It took me "a little" to understand the code but now I can separate stereo audio in the colab notebook.

danielkorg commented 4 years ago

Thank You!!. It took me "a little" to understand the code but now I can separate stereo audio in the colab notebook.

Im curious, how did you make it work. I am trying to make it work with 44.1kHz stereo input and correct output, so 4 stems of 44.1kHz and all stereo.

RadioAngurem commented 4 years ago

Hi, the code work with 1.5 minutes stereo songs and the output are four stereo stems with rate 32000 HZ.

I have appended this lines to the resample definition from the evaluate.py:

**mix_left = [s[0:1, :, :] for s in mix]
mix_right = [s[1:2, :, :] for s in mix]**
del mix

And then I duplicate the code for left and right channel:

network.eval()
with torch.no_grad():        
    separationL = network.inference(mix_left, n_chunks=2)[-1]  # call the network to obtain the separated audio with shape [1, 4, 1, T']
    separationR = network.inference(mix_right, n_chunks=2)[-1]  # call the network to obtain the separated audio with shape [1, 4, 1, T']

# normalize the amplitudes by computing the least squares
# -> we try to scale the separated stems so that their sum is equal to the input mix 
aL = separationL[0,:,0,:].cpu().numpy().T  # separated stems
aR = separationR[0,:,0,:].cpu().numpy().T  # separated stems

bL = mix_left[-1][0,0,:].cpu().numpy()  # input mix
bR = mix_right[-1][0,0,:].cpu().numpy()  # input mix

solL = np.linalg.lstsq(aL, bL, rcond=None)[0]  # scaling coefficients that minimize the MSE
solR = np.linalg.lstsq(aR, bR, rcond=None)[0]  # scaling coefficients that minimize the MSE

separationL = aL * solL  # scale the separated stems
separationR = aR * solR  # scale the separated stems

Finally, concate left_stems and right_Stems:

separation = np.concatenate((separationL, separationR), axis=1)
print(separation.shape) 

estimates = {
    'drums': separation[:,[0,4]],
    'bass': separation[:,[1,5]],
    'other': separation[:, [2,6]],
    'vocals': separation[:,[3,7]],
}

davda54 commented 4 years ago

Hey, I'm happy that it finally works for you :) Here's my gist for separating a stereo signal and resampling it back to the original sampling rate: https://gist.github.com/davda54/aa555c011866392c32c4906f8a709682

RadioAngurem commented 4 years ago

I have two questions after separated a bunch of tracks in mono and also in stereo:

The output stems are too loud. How can i deactivate the normalization of the audio stems?. This normalization causes very hard clipping in the signal. It occurs in stereo and mono separation and in all the tracks that I have tried.
As FSharpCSharp says in other issue I have also noticed the signal cut off about 12 dB above 10KHz.

Are these problems caused by any parameter in the estimated code?.

davda54 commented 4 years ago

This may be caused by the IPython.display player which normalizes every audio for some reason. Have you tried downloading the separated signal and playing in a proper audio player?
You're right, we're unsure what is the cause of this phenomenon. It seems to be an internal property of the neural network. Please let me know if you find out more about this.

RadioAngurem commented 4 years ago

I actually delete IPython and the yoyutube lines from the code because I don´t use it. This is the code that call separate and write the output file: audio, rate = soundfile.read(filename) print("separating... ", end='') estimates = separate_sample(audio, rate) print("done") print("downloading audio files to the client side...")

for instrument in ['vocals', 'drums', 'bass', 'other']: separation = estimates[instrument] print(separation.shape) soundfile.write('WM1_5' + instrument + '.wav', separation, 32000)

After that I dowloaded the file and I opened it with Reaper. I put the original wav file (White Man Worlds from Jason Isbell) with the Demucs vocal track and the MultiTasnet Vocal track. Wave

davda54 commented 4 years ago

I see, that doesn't look good – could you share the exact code and .wav file that you use? Feel free to send it to my email address david.samuel@seznam.cz

RadioAngurem commented 4 years ago

I run the notebook again with your separation stereo code instead of my "frankestein" version and it works smoothly. This is your code:

    network.eval()
    with torch.no_grad():
        separation_left = network.inference(mix_left, n_chunks=8)[-1].cpu().squeeze_(2)  # shape: (5, T)
        separation_right = network.inference(mix_right, n_chunks=8)[-1].cpu().squeeze_(2)  # shape: (5, T)

        separation = torch.cat([separation_left, separation_right], 0).numpy()

    estimates = {
        'drums': librosa.core.resample(separation[:, 0, :], 32000, rate, res_type='kaiser_best', fix=True)[:, :audio.shape[1]].T,
        'bass': librosa.core.resample(separation[:, 1, :], 32000, rate, res_type='kaiser_best', fix=True)[:, :audio.shape[1]].T,
        'other': librosa.core.resample(separation[:, 2, :], 32000, rate, res_type='kaiser_best', fix=True)[:, :audio.shape[1]].T,
        'vocals': librosa.core.resample(separation[:, 3, :], 32000, rate, res_type='kaiser_best', fix=True)[:, :audio.shape[1]].T,
    }

    a_l = np.array([estimates['drums'][:, 0], estimates['bass'][:, 0], estimates['other'][:, 0], estimates['vocals'][:, 0]]).T
    a_r = np.array([estimates['drums'][:, 1], estimates['bass'][:, 1], estimates['other'][:, 1], estimates['vocals'][:, 1]]).T

    b_l = audio[0, :]
    b_r = audio[1, :]

and I was using the code:

network.eval()
with torch.no_grad():        
    separationL = network.inference(mix_left, n_chunks=8)[-1]
    # call the network to obtain the separated audio with shape [1, 4, 1, T']
    separationR = network.inference(mix_right, n_chunks=8)[-1]
    #Ojo que el parámetro de chunks en el códido original es chunks=2
# normalize the amplitudes by computing the least squares
# -> we try to scale the separated stems so that their sum is equal to the input mix 
aL = separationL[0,:,0,:].cpu().numpy().T  # separated stems
aR = separationR[0,:,0,:].cpu().numpy().T  # separated stems
bL = mix_left[-1][0,0,:].cpu().numpy()  # input mix
bR = mix_right[-1][0,0,:].cpu().numpy()  # input mix

Thank you for answer so quick!

davda54 commented 4 years ago

Haha, I'm glad it helped :)

pfnet-research / meta-tasnet

Support of stereo signals during separation (throughout the process) #2