Trainig with custom dataset

LinuxBeginner commented 3 years ago

Hi, thank you for providing the repository.

Could you please guide me, how should I prepare my dataset, so that I can run the experiment?

Current dataset structure is as follows:

Source language: source1.wav source1.txt (transcript of source1.wav) source2.wav source2.txt ....

Traget language target1.txt ( translation of source1.txt) target2.txt ....

I have gone through this tutorial too Getting Started with End-to-End Speech Translation. But, I could not understand how I should prepare or arrange my dataset as per FBK-Fairseq-ST requirement. Should I create a csv file and put the wav file names (source language) in the first column and the text (target language) in the next coulmn OR any other json/csv file that will keep track or map the audio and the text file.

I am new in this field, I be would thankful for any guidance.

Thank you.

mattiadg commented 3 years ago

Hi, the audio and text data should live in separate files. From the audio, you should extract features like MFCC to be processed by these models, while the text is ok on txt files, but you may want to run some preprocessing like tokenization or removal of annotations. @mgaido91 can you suggest state-of-the-art open source software for this?

mgaido91 commented 3 years ago

Hi, I suggest you to do the following.

1. Generate YAML / other files

You need to have a single file with all the transcripts and all the translations, one per line. In addition, you need to create a YAML file that contains for each line the corresponding audio segment. Eg. your source1.txt should be the first line of your train.src file, target1.txt the first line of your train.tgt and in train.yaml the first line should be:

- {duration: HOW_LONG_YOUR_FILE_IS_OR_A_VERY_LARGE_NUMBER, offset: 0.0, speaker_id: ID_OF_THE_SPEAKER_OR_NAME_OF_THE_FILE, wav: source1.wav}

2. Preprocessing text

Download moses and preprocess both transcripts and translations with:

${mosesdir}normalize-punctuation.perl -l $lang < $YOUR_INPUT_FILE | ${mosesdir}tokenizer.perl -l $lang | ${mosesdir}deescape-special-chars.perl > $YOUR_INPUT_FILE.tok

Then, learn BPE or other subword segmentation using SentencePience or any other tool you like. After this step, create the fairseq datasets containing the textual information with

python FBK-fairseq-ST/preprocess.py --trainpref $YOUR_TRAIN.bpe --validpref $YOUR_DEV.bpe --testpref $YOUR_TEST.bpe --destdir $THE_DIR_WHERE_YOU_WANT_YOUR_FAIRSEQ_DATASETS -s $src_lang -t $tgt_lang --workers 1 --dataset-impl cached

Finally, create symbolic links to have a name that does not contain the src_lang-tgt_lang pattern with:

for f in $THE_DIR_WHERE_YOU_WANT_YOUR_FAIRSEQ_DATASETS*.$src_lang-$tgt_lang.*; do ln -s $f $(echo $f | sed 's/\.$src_lang-'$tgt_lang'\./\./g'); done

3. Preprocess audio

First, extract with XNMT the Mel filterbank features. You need to create a yaml config file like this:

extract-test-data: !Experiment
  preproc: !PreprocRunner
    overwrite: False
    tasks:
    - !PreprocExtract
      in_files:
      - $THE_YAML_YOU_GENERATED_IN_1
      out_files:
      - $YOUR_H5_OUTPUT_FILE.h5
      specs: !MelFiltExtractor {}

and then you can use it with this command:

python xnmt/xnmt/xnmt_run_experiments.py config.yaml

At the end of this process, you need to preprocess the h5 dataset into a fairseq dataset, which is done with

python FBK-fairseq-ST/examples/speech_recognition/preprocess_audio.py --destdir $THE_DIR_WHERE_YOU_WANT_YOUR_FAIRSEQ_DATASETS --format h5 --trainpref $YOUR_H5_TRAIN_OUTPUT_FILE --validpref $YOUR_H5_DEV_OUTPUT_FILE--testpref $YOUR_H5_TEST_OUTPUT_FILE

After this, in your target folder you should have files like these:

train.src_lang.idx
train.src_lang.bin
train.tgt_lang.idx
train.tgt_lang.bin
train.npz.idx
train.npz.bin
...

And you are done with your preprocessing!

mgaido91 / FBK-fairseq-ST