Open Morank88 opened 4 years ago
Yes Sure! Kindly let me know the exact issue while loading model. It will great you can copy paste the error here.
Great, thx.
So here is where my script failed:
convtasnet_audio_with_asr_model = DataParallel(ConvTasNet(C=2, test_with_asr=True)).cuda() convtasnet_audio_without_asr_model = DataParallel(ConvTasNet(C=2, asr_addition=False)).cuda()
convtasnet_audio_without_asr_model.load_state_dict(torch.load(convtasnet_model)['model_state_dict']) convtasnet_audio_with_asr_model.load_state_dict(torch.load(convtasnet_asr_model)['model_state_dict'])
The convtasnet_model is AudioOnlyConvTasNet.pth and the convtasnet_asr_model is ASR.pth, both downloaded from the links given in your github repository.
The errors that I got are the following:
Traceback (most recent call last):
File "/opt/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "
When loading the asr_model with 'model' instead of 'model_state_dict':
convtasnet_audio_with_asr_model.load_state_dict(torch.load(convtasnet_asr_model)['model'])
then I get the following mismatches:
raceback (most recent call last):
File "/opt/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "
I appreciate your help.
Thanks!
Thanks for elaborating the error! I guess you are trying to run test_real.py(/Oracle/test_real.py), there you need two model, one is convtasnet model 'AudioOnlyConvTasNet.pth' and the other one is convtasnet with asr model that is 'Oracle.pth' where the features of Automatic Speech Recognition is given to ConvTasnet for better result and performance. You are trying to load ASR.pth which is Automatic Speech Recognition Model for 'https://github.com/mayank-git-hub/ETE-Speech-Recognition'.
So try to load 'Oracle.pth' in place of 'ASR.pth', let the key be 'model_state_dict'.
Let me know if this solves the issue?
Hello @Morank88 , Is the error resolved? Let me know if you are still facing any issue here. Happy to help!
Hi,
It solves the issue of loading the checkpoint, but after running the test on my own mixture wav file I got bad results, basically noise... Here is my script:
convtasnet_model = r'/home/Projects/Speech_Enhancement/Improving_voice_separating/pretrained_models/AudioOnlyConvTasNet.pth' convtasnet_asr_model = r'/home/Projects/Speech_Enhancement/Improving_voice_separating/pretrained_models/Oracle.pth'
mixture_file = r'/home/Projects/Speech_Enhancement/Improving_voice_separating/DB/my_mixture_8k.wav'
convtasnet_audio_with_asr_model = DataParallel(ConvTasNet(C=2, test_with_asr=True)).cuda() convtasnet_audio_without_asr_model = DataParallel(ConvTasNet(C=2, asr_addition=False)).cuda()
convtasnet_audio_without_asr_model.load_state_dict(torch.load(convtasnet_model)['model_state_dict']) convtasnet_audio_with_asr_model.load_state_dict(torch.load(convtasnet_asr_model)['model_state_dict'])
convtasnet_audio_without_asr_model.eval() convtasnet_audio_with_asr_model.eval()
mixture = read(mixture_file)[1]/np.iinfo(np.int16).max mixture = normalise(mixture).astype(np.float32) mixture = torch.from_numpy(mixture).cuda() mixture = mixture.unsqueeze(0)
mixture = mixture[:, 48000:72000]
separated_initial = convtasnet_audio_without_asr_model(mixture) separated = convtasnet_audio_with_asr_model(mixture, separated_initial) write('./Results/estimate_init_0.wav', fs, (separated_initial[0, 0, :].data.cpu().numpy() np.iinfo(np.int16).max).astype(np.int16)) write('./Results/estimate_init_1.wav', fs, (separated_initial[0, 1, :].data.cpu().numpy() np.iinfo(np.int16).max).astype(np.int16)) write('./Results/estimate_0.wav', fs, (separated[0, 0, :].data.cpu().numpy() np.iinfo(np.int16).max).astype(np.int16)) write('./Results/estimate_1.wav', fs, (separated[0, 1, :].data.cpu().numpy() np.iinfo(np.int16).max).astype(np.int16))
I have resampled the wav file to 8k. Is the separation limited to wav of only 3 seconds (24000 samples)? What I saw is that the est_mask after the initial separation gets very large values...
Maybe my pre-processing is different?
Hello @Morank88, Yes, the code takes 3 seconds sample from the audio and process that, the estimated mask is generated for 2 speaker and hence have a dimension of [batch size, number of speaker(2), 24000]. In data preprocessing I am joining two audio of individual speaker to make a mixture and target is concatenation of individual files. The code can run for more than 3 sec audio length but some hard coded value needs to be changed, as asr features are extracted and given along with mixture to separator block, thats why in the DomainTranslation (domainTranslation.py) is returning features of dimension [M, N, 2399]. You can change that value according to mixture ie audio length and code should work fine for different audio length too. If you are facing any error, can you upload your audio files along with separated audio files?
Hi @pragyak412,
Ok, I understand. What should be the dimension conversion of the asr features for arbitrary audio length? BTW, it seems that my pre-processing follows what is required.
Regarding my audio file, it is a real simultaneously speech of two speakers so the separated audio files are not available.
Hi,
I have tried lo load the published pre-trained models but I got mismatches between the models definition (w/o asr) and the checkpoints files. Can you assist?
Thanks!