pfnet-research / meta-tasnet

A PyTorch implementation of Meta-TasNet from "Meta-learning Extractors for Music Source Separation
MIT License
136 stars 16 forks source link

Incorrect output shape #3

Closed danielkorg closed 4 years ago

danielkorg commented 4 years ago

If input is 44,100 kHz, the separated outputs are 32,000kHz, it should be equal to the original input 44,100kHz. But the problem is not just the sampling itself, the output stems are shorter than the input mixture by the same ratio between their sampling frequencies. in other words ratio = 44100/32000 is the same ratio in terms of stem length in seconds as mixtureLengthInSeconds/anyOutputStemLengthInSeconds, if you understand what I mean. It should not truncate it like that, it should output the same length and sum to the original mixture. this is the script I used to test, it is from your Colab demo:

import torch from model.tasnet import MultiTasNet import soundfile import librosa import numpy as np

state = torch.load("best_model.pt") # load checkpoint

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # optionally use the GPU

device = torch.device("cpu") # use only cpu, if gpu gives problems network = MultiTasNet(state["args"]).to(device) # initialize the model network.load_state_dict(state['state_dict']) # load weights from the checkpoint

def separate_sample(audio, rate: int):

def resample(audio, target_rate):
    return librosa.core.resample(audio, rate, target_rate, res_type='kaiser_best', fix=False)

audio = audio.astype('float32')  # match the type with the type of the weights in the network
mix = [resample(audio, s) for s in[8000, 16000, 32000]]  # resample to different sampling rates for the three stages
#print(mix.shape)
mix = [librosa.util.fix_length(m, (mix[0].shape[-1]+1)*(2**i)) for i,m in enumerate(mix)]  # allign all three sample so that their lenghts are divisible
#print(mix.shape)
mix = [torch.from_numpy(s).float().to(device).unsqueeze_(1) for s in mix]  # cast to tensor with shape: [1, 1, T']
#print(mix.shape)
mix = [s / s.std(dim=-1, keepdim=True) for s in mix]  # normalize by the standard deviation
#print(mix.shape)

network.eval()
with torch.no_grad():        
    separation = network.inference(mix, n_chunks=2)[-1]  # call the network to obtain the separated audio with shape [1, 4, 1, T']

# normalize the amplitudes by computing the least squares
# -> we try to scale the separated stems so that their sum is equal to the input mix 
print(separation.shape)
a = separation[0,:,0,:].cpu().numpy().T  # separated stems
print(a.shape)
b = mix[-1][0,0,:].cpu().numpy()  # input mix
print(b.shape)
sol = np.linalg.lstsq(a, b, rcond=None)[0]  # scaling coefficients that minimize the MSE
print(sol.shape)
separation = a * sol  # scale the separated stems
print(separation.shape)

estimates = {
    'drums': separation[:,0:1],
    'bass': separation[:,1:2],
    'other': separation[:,2:3],
    'vocals': separation[:,3:4],
}

return estimates

audio, rate = soundfile.read("test.wav") audio = librosa.core.to_mono(audio.transpose()) print(audio.shape, rate)

audio = np.expand_dims(audio, 0)

print() print("separating... ", end='') estimates = separate_sample(audio, rate) print("done")

print("saving audio files to folder...")

print(estimates.shape)

drums = estimates["drums"] print(drums.shape) bass = estimates["bass"] print(bass.shape) other = estimates["other"] print(other.shape) vocals = estimates["vocals"] print(vocals.shape)

soundfile.write("test_drums.wav", drums, rate) soundfile.write("test_bass.wav", bass, rate) soundfile.write("test_other.wav", other, rate) soundfile.write("test_vocals.wav", vocals, rate)

davda54 commented 4 years ago

Hi! You are correct, the model works with the 32kHz sampling rate. If you need your output to be in 44.1 kHz, you have to resample it -- please see the evaluate.py script where we do exactly that (lines 59 - 64). Let me know if that solves your problem :relaxed:

iamnoob1 commented 4 years ago

Fix it for all Of us please!

danielkorg commented 4 years ago

The evaluate.py script seems to work with stereo files, that is fine, but because of that, then it introduces different indexing and transposes from the other script used in the Colab demo. Can you just show us how would you use that stereo evaluate.py script to separate an input 44.1kHz file and output 4 stems of same length as input and 44.1kHz, using your own demo script? It's easier to match your method, then for us to go back and forth on what is going on. Thank you!

davda54 commented 4 years ago

Hi, I'm sorry it took me so long... Here's a sample code that separates stereo and resamples it back to the original sampling rate :) https://gist.github.com/davda54/aa555c011866392c32c4906f8a709682

danielkorg commented 4 years ago

Everything works perfectly now! Thank you very much! :)

iamnoob1 commented 4 years ago

I can't separate files, pls tutorial