neillu23 / CDiffuSE

Conditional Diffusion Probabilistic Model for Speech Enhancement
Apache License 2.0
200 stars 34 forks source link

Resampling in training #4

Open yahshibu opened 2 years ago

yahshibu commented 2 years ago

Hello, thank you for making your code publicly available! Great work.

This is not a question, but I was confused about the implemented processing. That confusion is already resolved, but let me share what happened to me.

I know that README says, "this implementation assumes a sample rate of 16 kHz". On the other hand, the original sampling rate of VoiceBank-DEMAND is 48 kHz as you know. So, we need to resample audio signals for applying CDiffuSE to this dataset.

Indeed, your scripts resample audio signals in preprocessing and inference.

For preprocessing: https://github.com/neillu23/CDiffuSE/blob/e4b069f1cb40f5406cff0b9295426026c3ea6ecc/src/cdiffuse/preprocess.py#L37

For inference: https://github.com/neillu23/CDiffuSE/blob/e4b069f1cb40f5406cff0b9295426026c3ea6ecc/src/cdiffuse/inference.py#L185

But, audio signals are not resampled in training: https://github.com/neillu23/CDiffuSE/blob/e4b069f1cb40f5406cff0b9295426026c3ea6ecc/src/cdiffuse/dataset.py#L56-L57

To reproduce your experimental results, I read the README of this repository and used this script as it is. Then, audio signals of 48kHz are loaded in dataset.py and given to the diffusion model as they are during training. When I checked an audio signal that is saved here https://github.com/neillu23/CDiffuSE/blob/e4b069f1cb40f5406cff0b9295426026c3ea6ecc/src/cdiffuse/learner.py#L172 , I confirmed that it seems like the signal is played slowly. This is because an audio signal of 48 kHz is saved as a 16kHz signal.

It might be better either to resample audio signals also in dataset.py or to explicitly note that users need to resample audio signals by themselves in advance. That is clearer, at least to me.

Sorry for the long post. Best regards.

neillu23 commented 2 years ago

Hi @yahshibu , thank you so much for clearing this up!! I had been working with the 16kHz version for a while and I didn't notice that the original data in the link VoiceBank-DEMAND was 48kHz. This caused a lot of problems when users tried to reproduce it. I will add the resampling process to dataset.py as you suggested. Thanks again for your help!!!