Open McFlyy21 opened 1 month ago
Thank you for reaching out. The only thing you should take care is to generate the synthesised speech and audio by yourself, and put them under syn_path
: https://github.com/voidful/Codec-SUPERB/blob/SLT_Challenge/run.sh#L5. The data under ref_path
and syn_path
should follow the same structure to run run.sh
. The run.sh
will automatically evaluate different tasks for different datasets, such as using ravdess
for emotion recognition in stage 1, LibriSpeech
for ASR in stage 3, etc.
Thanks for your reply, so I need to manually split multiple datasets from the public test set codec_superb_data you guys gave into two big datasets according to the classification of SPEECH and AUDIO, and then get the re-synthesized speech and audio separately and run run.sh to get the two sets of evaluations under the classification of SPEECH and AUDIO respectively right?
Basically yes. One more thing: the evaluation data is small, so it doesn't take long to re-synthesis data. You may use ChatGPT, give it the ref_path
folder structure, let ChatGPT give you a script to leverage your codec for re-synthesis, save them in the same folder structure as the ref_path
.
Basically yes. One more thing: the evaluation data is small, so it doesn't take long to re-synthesis data. You may use ChatGPT, give it the
ref_path
folder structure, let ChatGPT give you a script to leverage your codec for re-synthesis, save them in the same folder structure as theref_path
.
Q1:
For objective test, do we only need to consider datasets under samples/
folder?
We should also classify these datasets to audio and speech by ourselves? Could you help check if the below classification is right?
List the datasets under samples/
:
dataset | Type |
---|---|
crema_d | speech |
esc-50 | audio |
fluent-speech-commands | speech |
fsd50k | audio |
gunshot_triangulation | audio |
libri2Mix_test | speech |
librispeech | speech |
quesst | speech |
snips_test_valid_subset | speech |
vox_lingua_top10 | speech |
voxceleb1 | speech |
Q2: Actually I also found there are different sampling rates for these test datasets including those for the downstream tasks. Are we expected to train a universal codec model with high sampling rate (to process low sampling rate and high sampling rate)? Or we can use different codec models for these tasks?
Q1: Yes Q2: Either is acceptable. If you have multiple codec models, please specify which codec corresponds to each sampling rate during submission.
Hi, I found that codec_superb_data contains many datasets and does not give the code for data preprocessing, does it mean that I need to resynthesize each dataset separately by myself according to the two dataset classifications of SPEECH and AUDIO, and run run.sh separately for evaluating resynthesized audio obtained based on each dataset? Or do I need to put similar resynthesized files under either SPEECH or AUDIO classification together in advance, and run run.sh to get a score for resynthesized audio for all datasets under the same classification? I'm a bit confused about the evaluation rules and would appreciate an answer.