Query on Resampling and Audio Format Compliance in Competition Rules

voidful / Codec-SUPERB

Audio Codec Speech processing Universal PERformance Benchmark

https://codecsuperb.com

187 stars 22 forks source link

Query on Resampling and Audio Format Compliance in Competition Rules #33

Open huazhi1024 opened 2 months ago

huazhi1024 commented 2 months ago

Hello, in the released development set, different test sets have varying sampling rates such as 8kHz, 16kHz, 44.1kHz, and 48kHz, as well as different audio formats like WAV and FLAC. My model was trained on 16kHz speech data. During inference, if the input audio is not 16kHz, it will be automatically resampled to 16kHz before encoding and reconstruction. Does this comply with the competition rules?

hbwu-ntu commented 1 month ago

Thank you for bringing up this question.

Yes, resampling to 16kHz for both encoding and reconstruction is allowed.

However, please note that the evaluation pipeline expects the audio to be at the same sampling rate as the original datasets. Therefore, you should resample the audio back to its original sampling rate before evaluation.

We recommend saving the original audio's sampling rate (sr) when loading the audio. After codec reconstruction, just resample the reconstructed audio to the original sampling rate (sr). This should not add much effort to your resynthesis python script.

Thank you.

redmist328 commented 1 month ago

Hi @hbwu-ntu ,

I also have a similar problem. If my codec is trained on 16 kHz data, then any data above 16 kHz is actually very easy for me to handle. I just need to downsample it to 16 kHz, perform the computations, and then upsample the generated speech to the required sampling rate. This way, the generated speech, while empty in the frequency range above 8 kHz, at least sounds normal.

However, if I experiment with a high sampling rate, such as 48 kHz, and I need to encode an audio with a sampling rate of 16 kHz. If I first upsample it to 48 kHz, perform the computations, and then downsample the generated audio to 16 kHz, the resulting audio is almost inaudible. Even though my model can recover the input 48 kHz data quite well.

hbwu-ntu commented 1 month ago

@redmist328 Hi, Redmist, thank you for bringing up this point. We will compare codec models with the same sampling rate.