Closed panosapos closed 1 year ago
Note: the Danish Dataset has two channels (right and left ear). We should only keep the first channel and use this and the corresponding Spectrogram as input
The generation of the spectrograms should remain the same. In conditional training the crop length depends on the n_mel_crop_frames parameter and is connected to the length of the audio signal with the following formula:
Nx = N_frames * hop_size , where Nx is the audio length in samples
For 16 kHz, the predefined value of hop_size (62), corresponds to approximately 1 s of audio
This is issue is actually closed. Regarding the two chanels of the Danish Dataset, the Dataset class always returns the first channel. Regarding the spectrograms, no additional work is needed
Since the two datasets have a different sampling rate, we should either upsample or downsample our data during preprocessing/training and inference. This can simply be done by specifying the sampling rate in torchaudio.load However, in case we decide to downsample the data, we have to think whether we need to make any additional modifications to the way we extract spectrograms.