pragyak412 / Improving-Voice-Separation-by-Incorporating-End-To-End-Speech-Recognition

Implementing the paper -
18 stars 2 forks source link

problem with loading pre-trained models #1

Open Morank88 opened 4 years ago

Morank88 commented 4 years ago

Hi,

I have tried lo load the published pre-trained models but I got mismatches between the models definition (w/o asr) and the checkpoints files. Can you assist?

Thanks!

pragyak412 commented 4 years ago

Yes Sure! Kindly let me know the exact issue while loading model. It will great you can copy paste the error here.

Morank88 commented 4 years ago

Great, thx.

So here is where my script failed:

convtasnet_audio_with_asr_model = DataParallel(ConvTasNet(C=2, test_with_asr=True)).cuda() convtasnet_audio_without_asr_model = DataParallel(ConvTasNet(C=2, asr_addition=False)).cuda()

convtasnet_audio_without_asr_model.load_state_dict(torch.load(convtasnet_model)['model_state_dict']) convtasnet_audio_with_asr_model.load_state_dict(torch.load(convtasnet_asr_model)['model_state_dict'])

The convtasnet_model is AudioOnlyConvTasNet.pth and the convtasnet_asr_model is ASR.pth, both downloaded from the links given in your github repository.

The errors that I got are the following:

Traceback (most recent call last): File "/opt/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in convtasnet_audio_with_asr_model.load_state_dict(torch.load(convtasnet_asr_model)['model_state_dict']) KeyError: 'model_state_dict'

When loading the asr_model with 'model' instead of 'model_state_dict': convtasnet_audio_with_asr_model.load_state_dict(torch.load(convtasnet_asr_model)['model']) then I get the following mismatches: raceback (most recent call last): File "/opt/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in convtasnet_audio_with_asr_model.load_state_dict(torch.load(convtasnet_asr_model)['model']) File "/opt/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for DataParallel: Missing key(s) in state_dict: "module.encoder.conv1d_U.weight", "module.asr.module.encoder.embed.conv.0.weight", "module.asr.module.encoder.embed.conv.0.bias", "module.asr.module.encoder.embed.conv.2.weight", "module.asr.module.encoder.embed.conv.2.bias", "module.asr.module.encoder.embed.out.0.weight", "module.asr.module.encoder.embed.out.0.bias", "module.asr.module.encoder.embed.out.1.pe", "module.asr.module.encoder.encoders.0.self_attn.linear_q.weight", "module.asr.module.encoder.encoders.0.self_attn.linear_q.bias", "module.asr.module.encoder.encoders.0.self_attn.linear_k.weight" . . . . and so on...

I appreciate your help.

Thanks!

pragyak412 commented 4 years ago

Thanks for elaborating the error! I guess you are trying to run test_real.py(/Oracle/test_real.py), there you need two model, one is convtasnet model 'AudioOnlyConvTasNet.pth' and the other one is convtasnet with asr model that is 'Oracle.pth' where the features of Automatic Speech Recognition is given to ConvTasnet for better result and performance. You are trying to load ASR.pth which is Automatic Speech Recognition Model for 'https://github.com/mayank-git-hub/ETE-Speech-Recognition'.

So try to load 'Oracle.pth' in place of 'ASR.pth', let the key be 'model_state_dict'.

Let me know if this solves the issue?

pragyak412 commented 4 years ago

Hello @Morank88 , Is the error resolved? Let me know if you are still facing any issue here. Happy to help!

Morank88 commented 4 years ago

Hi,

It solves the issue of loading the checkpoint, but after running the test on my own mixture wav file I got bad results, basically noise... Here is my script:

convtasnet_model = r'/home/Projects/Speech_Enhancement/Improving_voice_separating/pretrained_models/AudioOnlyConvTasNet.pth' convtasnet_asr_model = r'/home/Projects/Speech_Enhancement/Improving_voice_separating/pretrained_models/Oracle.pth'

mixture_file = r'/home/Projects/Speech_Enhancement/Improving_voice_separating/DB/my_mixture_8k.wav'

convtasnet_audio_with_asr_model = DataParallel(ConvTasNet(C=2, test_with_asr=True)).cuda() convtasnet_audio_without_asr_model = DataParallel(ConvTasNet(C=2, asr_addition=False)).cuda()

convtasnet_audio_without_asr_model.load_state_dict(torch.load(convtasnet_model)['model_state_dict']) convtasnet_audio_with_asr_model.load_state_dict(torch.load(convtasnet_asr_model)['model_state_dict'])

convtasnet_audio_without_asr_model.eval() convtasnet_audio_with_asr_model.eval()

mixture = read(mixture_file)[1]/np.iinfo(np.int16).max mixture = normalise(mixture).astype(np.float32) mixture = torch.from_numpy(mixture).cuda() mixture = mixture.unsqueeze(0)

mixture = mixture[:, 48000:72000]

separated_initial = convtasnet_audio_without_asr_model(mixture) separated = convtasnet_audio_with_asr_model(mixture, separated_initial) write('./Results/estimate_init_0.wav', fs, (separated_initial[0, 0, :].data.cpu().numpy() np.iinfo(np.int16).max).astype(np.int16)) write('./Results/estimate_init_1.wav', fs, (separated_initial[0, 1, :].data.cpu().numpy() np.iinfo(np.int16).max).astype(np.int16)) write('./Results/estimate_0.wav', fs, (separated[0, 0, :].data.cpu().numpy() np.iinfo(np.int16).max).astype(np.int16)) write('./Results/estimate_1.wav', fs, (separated[0, 1, :].data.cpu().numpy() np.iinfo(np.int16).max).astype(np.int16))

I have resampled the wav file to 8k. Is the separation limited to wav of only 3 seconds (24000 samples)? What I saw is that the est_mask after the initial separation gets very large values...

Morank88 commented 4 years ago

Maybe my pre-processing is different?

pragyak412 commented 4 years ago

Hello @Morank88, Yes, the code takes 3 seconds sample from the audio and process that, the estimated mask is generated for 2 speaker and hence have a dimension of [batch size, number of speaker(2), 24000]. In data preprocessing I am joining two audio of individual speaker to make a mixture and target is concatenation of individual files. The code can run for more than 3 sec audio length but some hard coded value needs to be changed, as asr features are extracted and given along with mixture to separator block, thats why in the DomainTranslation (domainTranslation.py) is returning features of dimension [M, N, 2399]. You can change that value according to mixture ie audio length and code should work fine for different audio length too. If you are facing any error, can you upload your audio files along with separated audio files?

Morank88 commented 4 years ago

Hi @pragyak412,

Ok, I understand. What should be the dimension conversion of the asr features for arbitrary audio length? BTW, it seems that my pre-processing follows what is required.

Regarding my audio file, it is a real simultaneously speech of two speakers so the separated audio files are not available.