Gaps in the resulting audio file in the umxse model

sigsep / open-unmix-pytorch

Open-Unmix - Music Source Separation for PyTorch

https://sigsep.github.io/open-unmix/

MIT License

1.24k stars 181 forks source link

Gaps in the resulting audio file in the umxse model #127

Closed frixos25 closed 5 months ago

frixos25 commented 1 year ago

🐛 Bug

If I calculate an audio file with the model umxse or with my own trained model for voice cleaning, there are gaps in the resulting file that are about 1/10 second and are set with -32768 in a 16 bit pcm wav.
The gaps appears randomly all 30-90s estimated. I couldn't find out a connection between the position of the gaps and the input language record. The resulting filename in the umxse model is Speech.WAV. This error does not occur with the default music model umxl.

Environment

PyTorch Version: 1.13.1
OS: Windows 11
torchaudio loader (y/n):y
Python version: 3.9.13
CUDA/cuDNN version: only CPU

faroit commented 1 year ago

Are you feeding in the right sample rate? Maybe this is a resampling problem... I can investigate

frixos25 commented 1 year ago

Yes, i do. 44100Hz. My own model has also this frequency

faroit commented 1 year ago

umxse is requiring 16khz. Depending on how you run the model you would need to resample yourself

frixos25 commented 1 year ago

i tried umxe also with 16khz samplingrate. Ther are also gaps in 30-150s intervals.

faroit commented 1 year ago

I tried to reproduce that without success. Here is an example of the separation of a 3min sine wave input

i suggest you try to use the umx commandline separator and load the audio in the same way we do here - if that's not the case yet

frixos25 commented 1 year ago

i tested a 45 min podcast with a 16k samplingrate stereo with a lot of gaps and in mono with only one gap.

The output speech file is alway in stereo. Is there an error in this code?

mixture, _ = sf.read(r"C:\temp\LanzPrecht16k.wav", dtype="float32", always_2d=False) estimates = predict.separate(torch.as_tensor(mixture).float(), rate = 16000, model_str_or_path='umxse', device='cpu')

estimates_numpy = {} for target, estimate in estimates.items(): estimates_numpy[target] = torch.squeeze(estimate).detach().cpu().numpy().T

target_path = str(r"C:\openunmix\umx_demo\openunmix\output\target.mp3") stempeg.write_stems( target_path, estimates_numpy, sample_rate=16000, writer=stempeg.FilesWriter(multiprocess=False, output_sample_rate=16000), )

faroit commented 1 year ago

Can you post a segment around the error here, please?

frixos25 commented 1 year ago

Here an example speech.wav from a stereo 16k input file: https://drive.google.com/file/d/1JxJGaQLXHJd_kykMudblKWWU4avmtEvg/view?usp=sharing

faroit commented 1 year ago

thats very weird indeed. Is this deterministic with the same input file?

faroit commented 1 year ago

Maybe @TE-StefanUhlich has an idea? Is this some instability with the lstm?

frixos25 commented 1 year ago

I tested on a Windows 10 and Windows 11 computer. It is deterministic on the same computer but different between this machines.

StefanUhlich-sony commented 1 year ago

I could imagine that the problem is related to the Wiener filtering which breaks down due to receiving a stereo file which actually only is monaural. Do you use a Wiener filter at the output? Could you try to just feed in a mono file?

frixos25 commented 1 year ago

I use the default values from predict.py. I think wiener filter is on. (See my code i posted) With a mono input file i had only one gap in a 45 minutes record.

StefanUhlich-sony commented 1 year ago

Could you try to run predict.py without Wiener filtering (just using the raw outputs)?

frixos25 commented 1 year ago

is this the code i should try to deactivate? (model.py 294) for sample in range(nb_samples): pos = 0 if self.wiener_win_len: wiener_win_len = self.wiener_win_len else: wiener_win_len = nb_frames while pos < nb_frames: cur_frame = torch.arange(pos, min(nb_frames, pos + wiener_win_len)) pos = int(cur_frame[-1]) + 1 targets_stft[sample, cur_frame] = wiener( spectrograms[sample, cur_frame], mix_stft[sample, cur_frame], self.niter, softmask=self.softmask, residual=self.residual, )

faroit commented 1 year ago

@frixos25 no, use set niter=0 when you call

https://github.com/sigsep/open-unmix-pytorch/blob/05fd4d8a0e3e50e308579052d762a342647c3408/openunmix/predict.py#L4-L16

frixos25 commented 1 year ago

Yes, without wiener filter there are no gaps with a stereo input file.

faroit commented 1 year ago

@frixos25 Great. The wiener filter is probably unstable with such long inputs.

frixos25 commented 1 year ago

Thank you for your support. I tested the umxse and my own model with 44100Hz samplerate with and without the wiener filter. In the umxse model i can not hear a difference. In my model i hear more high frequency noise.

Will the stability of the wiener filter be improved in the next future?

faroit commented 5 months ago

@frixos25 you maybe also try https://github.com/sigsep/norbert which has higher accuracy than torch when running on cpu