Resampling in training - Githubissues

Hello, thank you for making your code publicly available! Great work.

This is not a question, but I was confused about the implemented processing. That confusion is already resolved, but let me share what happened to me.

I know that README says, "this implementation assumes a sample rate of 16 kHz". On the other hand, the original sampling rate of VoiceBank-DEMAND is 48 kHz as you know. So, we need to resample audio signals for applying CDiffuSE to this dataset.

Indeed, your scripts resample audio signals in preprocessing and inference.

For preprocessing: https://github.com/neillu23/CDiffuSE/blob/e4b069f1cb40f5406cff0b9295426026c3ea6ecc/src/cdiffuse/preprocess.py#L37

For inference: https://github.com/neillu23/CDiffuSE/blob/e4b069f1cb40f5406cff0b9295426026c3ea6ecc/src/cdiffuse/inference.py#L185

But, audio signals are not resampled in training: https://github.com/neillu23/CDiffuSE/blob/e4b069f1cb40f5406cff0b9295426026c3ea6ecc/src/cdiffuse/dataset.py#L56-L57

To reproduce your experimental results, I read the README of this repository and used this script as it is. Then, audio signals of 48kHz are loaded in dataset.py and given to the diffusion model as they are during training. When I checked an audio signal that is saved here https://github.com/neillu23/CDiffuSE/blob/e4b069f1cb40f5406cff0b9295426026c3ea6ecc/src/cdiffuse/learner.py#L172 , I confirmed that it seems like the signal is played slowly. This is because an audio signal of 48 kHz is saved as a 16kHz signal.

It might be better either to resample audio signals also in dataset.py or to explicitly note that users need to resample audio signals by themselves in advance. That is clearer, at least to me.

Sorry for the long post. Best regards.

neillu23 / CDiffuSE

Resampling in training #4