mustass / diffusion_models_for_speech

Deep Learning course project repository.
https://kurser.dtu.dk/course/02456
1 stars 0 forks source link

Downsample/Upsample Data #15

Closed panosapos closed 1 year ago

panosapos commented 1 year ago

Since the two datasets have a different sampling rate, we should either upsample or downsample our data during preprocessing/training and inference. This can simply be done by specifying the sampling rate in torchaudio.load However, in case we decide to downsample the data, we have to think whether we need to make any additional modifications to the way we extract spectrograms.

panosapos commented 1 year ago

Note: the Danish Dataset has two channels (right and left ear). We should only keep the first channel and use this and the corresponding Spectrogram as input

panosapos commented 1 year ago

The generation of the spectrograms should remain the same. In conditional training the crop length depends on the n_mel_crop_frames parameter and is connected to the length of the audio signal with the following formula:

Nx = N_frames * hop_size , where Nx is the audio length in samples

For 16 kHz, the predefined value of hop_size (62), corresponds to approximately 1 s of audio

panosapos commented 1 year ago

This is issue is actually closed. Regarding the two chanels of the Danish Dataset, the Dataset class always returns the first channel. Regarding the spectrograms, no additional work is needed