neillu23 / CDiffuSE

Conditional Diffusion Probabilistic Model for Speech Enhancement
Apache License 2.0
200 stars 34 forks source link

Try to reproduce but some issues occur #3

Open yizhidamiaomiao opened 2 years ago

yizhidamiaomiao commented 2 years ago

I run the command "./ 0 se model_se"

The issue is """"""""""""""""""""""""""""""""" Preprocessing: 0%| | 0/11572 [00:00<?, ?it/s] Traceback (most recent call last): File "src/cdiffuse/", line 140, in main(parser.parse_args()) File "src/cdiffuse/", line 120, in main list(tqdm(, filenames, repeat(args.dir), repeat(args.outdir)), desc='Preprocessing', total=len(filenames))) File "/home/tiger/.local/lib/python3.7/site-packages/tqdm/", line 1195, in iter for obj in iterable: File "/usr/lib/python3.7/concurrent/futures/", line 476, in _chain_from_iterable_of_lists for element in iterable: File "/usr/lib/python3.7/concurrent/futures/", line 586, in result_iterator yield fs.pop().result() File "/usr/lib/python3.7/concurrent/futures/", line 432, in result return self.get_result() File "/usr/lib/python3.7/concurrent/futures/", line 384, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. """"""""""""""""""""""""""""""""" How to solve this?

Although "se_pre" mode can run, with the dataset provided by your link, I MUST change the sample_rate to 48000 in, otherwise this code will throw a wrong information. Does this correct for the reproduce?

Also, I run for 12 hours with 4 GPU at step 156600 for "se_pre" mode. How long(how much epoch) do we need to train your model?

neillu23 commented 2 years ago

Hi @yizhidamiaomiao, thanks for sharing your experience! I've replaced torchaudio.load_wav() with the torchaudio.load() function in the new commit. This may fix some errors, as torchaudio.load_wav has been removed in newer versions of torchaudio. For the second question, can you share a link to the data, and is the sample rate of the data you are using 48000? Also, the "se_pre" step is no longer needed, as the randomly initialized CDiffuSE performs as well as the one initialized from pre-trained parameters. The model with step 507600 (no pre-training) in my experiments slightly exceeded our paper's results. Please try the new code and let me know if you have any further questions!

yizhidamiaomiao commented 2 years ago

Hi, thank you for your response!

By your instructions, I do not train the "se_pre" now. I tried to directly train your model by the command: "./ 0 se model_se" and evaluate at step 600075 by the command "./ 0 600075 se model_se".

However, the training result seems different from your folder 'Sample Files'. Here is the link of the generated speech by the trained model '', which may not competitive with the SOTA model. Could you please help us find out what should we do in order to reproduce your result in 'Sample Files' ?

neillu23 commented 2 years ago

Hi @yizhidamiaomiao, thanks for sharing the audio file! The command you are using seems to be from a previous commit. I've updated the command style and torchaudio functions in this commit: The new command would be ". / 0 model_se" and ". / 0 model_se 600075". Here are the results I got from my trained model '' The environment I used was torchaudio 0.9.0/ pytorch 1.9.0. If this doesn't work for you, please let me know; thanks again!

yizhidamiaomiao commented 2 years ago


Thanks for your response!

We download your newest code, and trained by the command ". / 0 model_se" and inferenced by command "./ 0 model_se 108000 ". The trained model is ''. The newest results we get are in the link with file named as "*_enhanced_ver 7e13e6e.wav". It seems that there still be some noise in those enhanced speech.

Shall we wait for step 507600?

The environment I used is torchaudio '0.10.0+cu113'/ pytorch 1.10.0.

Wait for any further guidance and thanks for your patient!

neillu23 commented 2 years ago

Thank you for reporting the following results!

I think a possible reason could be the difference between our training data. You mentioned the data you used with a 48000 sampling rate but the data I used are with a 16000 sample rate. Could you share your training data and model with me so I can try if your data/model works in my environment?

Thank you again, and sorry for the inconvenience!

yizhidamiaomiao commented 2 years ago

Thank you for reporting the following results!

I think a possible reason could be the difference between our training data. You mentioned the data you used with a 48000 sampling rate but the data I used are with a 16000 sample rate. Could you share your training data and model with me so I can try if your data/model works in my environment?

Thank you again, and sorry for the inconvenience!

I use the data directly from your link "" given in the sentence "The default dataset is VOICEBANK-DEMAND dataset. You can download them from VOICEBANK-DEMAND)" in the file. Actually the audio downloaded in the given website are 48k audio, and I need to write a torchaudio.resample(48k, 16k) in the function "transform" in your preprocess file to train the code.

neillu23 commented 2 years ago

The data I'm using is already at a 16k sample rate, which may be different from the one in the link. Could you try adding a torchaudio.resample(48k, 16k) for both "signal" and "noisysignal" in the __getitem_\ function here in NumpyDataset? If this works, I will change the description in the README. Sorry again about this issue.