Closed LinuxBeginner closed 3 years ago
Hi, the audio and text data should live in separate files. From the audio, you should extract features like MFCC to be processed by these models, while the text is ok on txt files, but you may want to run some preprocessing like tokenization or removal of annotations. @mgaido91 can you suggest state-of-the-art open source software for this?
Hi, I suggest you to do the following.
You need to have a single file with all the transcripts and all the translations, one per line. In addition, you need to create a YAML file that contains for each line the corresponding audio segment. Eg. your source1.txt
should be the first line of your train.src
file, target1.txt
the first line of your train.tgt
and in train.yaml
the first line should be:
- {duration: HOW_LONG_YOUR_FILE_IS_OR_A_VERY_LARGE_NUMBER, offset: 0.0, speaker_id: ID_OF_THE_SPEAKER_OR_NAME_OF_THE_FILE, wav: source1.wav}
Download moses and preprocess both transcripts and translations with:
${mosesdir}normalize-punctuation.perl -l $lang < $YOUR_INPUT_FILE | ${mosesdir}tokenizer.perl -l $lang | ${mosesdir}deescape-special-chars.perl > $YOUR_INPUT_FILE.tok
Then, learn BPE or other subword segmentation using SentencePience or any other tool you like. After this step, create the fairseq datasets containing the textual information with
python FBK-fairseq-ST/preprocess.py --trainpref $YOUR_TRAIN.bpe --validpref $YOUR_DEV.bpe --testpref $YOUR_TEST.bpe --destdir $THE_DIR_WHERE_YOU_WANT_YOUR_FAIRSEQ_DATASETS -s $src_lang -t $tgt_lang --workers 1 --dataset-impl cached
Finally, create symbolic links to have a name that does not contain the src_lang-tgt_lang
pattern with:
for f in $THE_DIR_WHERE_YOU_WANT_YOUR_FAIRSEQ_DATASETS*.$src_lang-$tgt_lang.*; do ln -s $f $(echo $f | sed 's/\.$src_lang-'$tgt_lang'\./\./g'); done
First, extract with XNMT the Mel filterbank features. You need to create a yaml config file like this:
extract-test-data: !Experiment
preproc: !PreprocRunner
overwrite: False
tasks:
- !PreprocExtract
in_files:
- $THE_YAML_YOU_GENERATED_IN_1
out_files:
- $YOUR_H5_OUTPUT_FILE.h5
specs: !MelFiltExtractor {}
and then you can use it with this command:
python xnmt/xnmt/xnmt_run_experiments.py config.yaml
At the end of this process, you need to preprocess the h5 dataset into a fairseq dataset, which is done with
python FBK-fairseq-ST/examples/speech_recognition/preprocess_audio.py --destdir $THE_DIR_WHERE_YOU_WANT_YOUR_FAIRSEQ_DATASETS --format h5 --trainpref $YOUR_H5_TRAIN_OUTPUT_FILE --validpref $YOUR_H5_DEV_OUTPUT_FILE--testpref $YOUR_H5_TEST_OUTPUT_FILE
After this, in your target folder you should have files like these:
train.src_lang.idx
train.src_lang.bin
train.tgt_lang.idx
train.tgt_lang.bin
train.npz.idx
train.npz.bin
...
And you are done with your preprocessing!
Hi, thank you for providing the repository.
Could you please guide me, how should I prepare my dataset, so that I can run the experiment?
Current dataset structure is as follows:
Source language: source1.wav source1.txt (transcript of source1.wav) source2.wav source2.txt ....
Traget language target1.txt ( translation of source1.txt) target2.txt ....
I have gone through this tutorial too Getting Started with End-to-End Speech Translation. But, I could not understand how I should prepare or arrange my dataset as per FBK-Fairseq-ST requirement. Should I create a csv file and put the wav file names (source language) in the first column and the text (target language) in the next coulmn OR any other json/csv file that will keep track or map the audio and the text file.
I am new in this field, I be would thankful for any guidance.
Thank you.