Replace arg --lang with --lang-model and --audio-corpus

Introduction

Currently, most scripts offer the argument --lang to choose between de and en. The choice de means that a language model is trained on the text corpora

Europarl
Parole

and that an acoustic model is trained on

a corrected version of the VoxForge corpus
a corrected version of the TU-Darmstadt corpus (gspv2)

There is no way to choose for example just one text corpus and one audio corpus. Hereby I propose to change the command line arguments of some scripts to make it simpler to pick text and audio corpora to train an ASR system.

Current workflow

An example workflow to train a German speech recognition system might look like

$ ./speech_sentences_de.py
$ ./speech_build_lm.py --lang de
$ ./speech_kaldi_export.py --lang de
$ cd data/dst/speech/de/kaldi
$ ./build-lm.sh
$ ./run-chain.sh

Under the hood speech_sentences_de.py extracts sentences from two text corpora (Europarl and Parole) and writes them to a text file containing one sentence per line. Both of the corpora have to be parsed in a unique way to extract sentences from them. Currently, there is no way to build a language model just on exactly one corpus (except by altering the script of course).

The command speech_build_lm.py --lang de concatenates text files containing one sentence per line and builds a language model (3-gram) using the program ngram.

The command speech_kaldi_export.py --lang de consumes all transcripts for the VoxForge and TU-Darmstadt corpora in data/src/speech/de/transcripts_*.csv, the pronunciation dictionary data/src/speech/de/dict.ipa and the language model created in the previous step and "deploys" them to speech/data/dst/speech/de/kaldi in a way that adheres to the Kaldi interface (regarding directory and file structure). Currently there is no way to conveniently choose exactly one, two, or more audio corpora. But it would be nice to be able to choose a small audio corpus to do regression testing.

The script build-lm.sh converts the language model from the ARPA format to the finite state transducer G.fst - as Kaldi expects it.

Finally, run-chain.sh uses Kaldi to train the acoustic model.

Here is an example sequence of commands to train a complete English ASR system with Kaldi:

$ ./speech_sentences_en.py
$ ./speech_build_lm.py --lang en
$ ./speech_kaldi_export.py --lang en
$ cd data/dst/speech/en/kaldi
$ ./build-lm.sh
$ ./run-chain.sh

Proposed workflow

This is an example of the proposed workflow for creating an ASR system:

$ ./speech_sentences.py europarl-de
$ ./speech_sentences.py parole
$ ./speech_sentences.py voxforge-de-prompts
$ ./speech_build_lm.py europarl-de \
                       parole \
                       voxforge-de-prompts \
                       lm-europarl-de-parole-voxforge-de-prompts
$ ./speech_kaldi_export.py --audio-corpus voxforge \
                           --audio-corpus tu-darmstadt \
                           --language-model lm-europarl-de-parole-voxforge-de-prompts \
                           --dictionary dict-de.ipa \
                           --model-name experiment1
$ cd data/dst/speech/kaldi/experiment1
$ ./build-lm.sh
$ ./run-chain.sh

In the example we create an ASR system where the language model is based on the Europarl and Parole corpora. To train the acoustic model, the VoxForge and TU-Darmstadt corpuora are used.

The command ./speech_sentences.py TEXTCORPUS writes the extracted sentences by convention to data/dst/text-corpora/TEXTCORPUS.txt.

The command speech_build_lm.py TEXTCORPUS [TEXTCORPUS ...] LMNAME expects as arguments a list of text corpora and a name for the resulting language model (LMNAME). The files of the language model are written to the directory data/dst/lm/LMNAME/.

The command speech_kaldi_export.py expects one or more --audio-corpus, exactly one --language-model, and exactly one --model-name MODELNAME. The script creates a directory data/dst/asr-models/kaldi/MODELNAME and places all files required by Kaldi in it.

@gooofy What do you think about my suggested changes?

UPDATES 2018-04-03

Augmented example workflow to show that speech_sentences.py can also extract sentences from VoxForge's prompts.
In example for proposed workflow: add --dictionary to speech_kaldi_export.py

Excellent suggestions. It will also simplify adding local resources.

One detail question: You mentioned a "corrected version of the TU-Darmstadt corpus (gspv2)". What kind of corrections are included? When are these applied?

By "corrected version" I mean that not all audio segments contained in the original corpus are put into Kaldi's wav.scp for training or testing:

https://github.com/gooofy/speech/blob/8060352ecd6e02a499a8da033a5513ed922c5f86/speech_transcripts.py#L143

According to Transcript.split() only segments in data/src/speech/de/transcripts_*.csv with quality >= 2 are considered. I reckon the repo owner filters out segments where prompt and actual recorded utterance don't match.

@mpuels excellent ideas and a very nice write-up! couple of remarks from my side:

with arbitrary model names we will have to have a command line option to specify which dictionary to use when exporting the model. we could even think about allowing multiple dictionaries for multi-language models (e.g. a model that understands english and german)?
we have discussed augmenting some corpora with filtered or distorted derivatives of the audio contained - I am not sure if we should outline that in the workflow at this point or just postpone the idea for now?
I feel this is a pretty big task (since nearly all scripts will be affected) which we should break down into smaller ones if possible. I was wondering if it would make sense to outline the structure of the data/src, data/dst and external ... directories first (i.e. where which model will live, where we will have additional config files - if any, where generated files will live ...) - so we have some sort of reference map when we're working on the individual scripts.
Also, I wonder if it would make sense to create a TODO list of all scripts affected (i.e. grep for --lang option across all scripts :o) ) and sort that list by priority - is there a minimum set of scripts we can adapt first so we get something working quickly even if it means that many things are broken at that point? - I think that could be important not only for motivation but also to get a feel for the new setup as soon as possible - what sounds good on paper doesn't necessarily translate into a good system for day-to-day use so we might want to be able to iterate these ideas early in the process.
the existing .csv databases will have to split up (voxforge, tu-darmstadt (gspv2), phone, noise, ...) - we will have to decide where the resulting partial .csv dbs will live and what they should be called

Pronunciation dictionary as command line arg. Good point, I had overlooked it. I suggest that for the time being we add a mandatory argument --dictionary for speech_kaldi_export.py. Later we can think about how to deal with multiple languages and dictionaries.
Augmented corpora. As you mentioned in point 3, the scope of my proposal is already big enough, so I'd say let's postpone the scripted augmentation of corpora.
Create subtasks. Outline directory structures. Good point on planning the directory structures. Here is the current and proposed structure for src/ and dst/ respectively. Have I missed any important files?

Current structure of src/:

src/
└── speech
    ├── de
    │   ├── dict.ipa
    │   ├── spk2gender
    │   ├── spk_test.txt
    │   ├── tokenizer_errors.txt
    │   ├── transcripts_00.csv
    │   ├── transcripts_01.csv
    │   └── transcripts_02.csv
    ├── en
    │   ├── dict.ipa
    │   ├── spk2gender
    │   ├── transcripts_00.csv
    │   ├── transcripts_01.csv
    │   ├── transcripts_02.csv
    │   └── transcripts_03.csv
    ├── kaldi-cmd.sh
...

Proposed structure of src/:

src/
  dicts/
    dict-de.ipa
    dict-en.ipa
  speech/
    voxforge-de/
      spk2gender
      spk_test.txt
      transcripts_00.csv
      transcripts_01.csv
    gspv2/
      spk2gender
      spk_test.txt
      transcripts_00.csv
  kaldi-cmd.sh
  ...

Current structure of dst/:

dst/
  speech/
    de/
      kaldi/
        ...
      srilm/
        lm.arpa
        lm_full.arpa
        train_all.txt
      punkt.pickle
      sentences.txt
    en/
      ...

Proposed structure of dst/:

dst/
  lm/
    lm-europarl-de/
      lm.arpa
      lm_full.arpa
      train_all.txt
    lm-europarl-de-parole/
      lm.arpa
      lm_full.arpa
      train_all.txt
  asr-models/
    kaldi/
      experiment1/
        ...
      experiment2/
        ...
  text-corpora/
    europarl-de.txt
    parole.txt
  tokenizers/
    punkt.pickle

TODO list. Small iterations. I've grepped through all Python scripts for --lang and came up with the following list of affected scripts. The scripts affected by my proposal (in section "Proposed workflow" above) are marked with an asterisk (*). Note that also the scripts speech_sentences_{de,en}.py are affected.

abook-transcribe.py
apply_review.py
auto_review.py
speech_audio_scan.py
* speech_build_lm.py
speech_deepspeech_export.py
speech_editor.py
speech_gender.py
* speech_kaldi_export.py
speech_lex_export_espeak.py
speech_lex_missing.py
speech_sequitur_export.py
speech_sphinx_export.py
speech_stats.py

You're right, a lot of scripts use the --lang argument. If we look at our 2 main uses of this repo for the near future, I come up with the following list of high priority scripts (~~please add more for the first use, as I'm not sure which scripts are relevant there~~):

a) grow audio corpus by transcribing Podcasts with existing transcriptions

abook-transcribe.py
speech_audio_scan.py
auto_review.py
~~speech_editor.py~~
~~speech_lex_edit.py~~

b) train models based on German audio corpora (VoxForge, TU-Darmstadt, Forschergeist)

speech_sentences_{de.en}.py
speech_build_lm.py
speech_kaldi_export.py

Split of transcription database. See point 3 above.

UPDATES 2018-04-03

Rename dst/speech to dst/asr-models.
Update scripts in "a) grow audio corpus by transcribing Podcasts with existing transcriptions"

@mpuels: once again thank you for this impressive writeup, I like it a lot! - and sorry for my late reply, just couldn't fit in a decent sized timeslot this week to give your ideas at least a somewhat appropiate amount of thought.

I suggest that for the time being we add a mandatory argument --dictionary for speech_kaldi_export.py. Later we can think about how to deal with multiple languages and dictionaries.

very good.

As you mentioned in point 3, the scope of my proposal is already big enough, so I'd say let's postpone the scripted augmentation of corpora.

agreed.

Here is the current and proposed structure for src/ and dst/ respectively.

[...] very good, I like it a lot. some remarks on minor details:

dst/ lm/ lm-europarl-de/

I think we could include the name of the tool that was used to compute the lm here, so we could support multiple lm tools in the future, e.g. srilm-europarl-de kenlm-europarl-de

etc

speech/

not sure if we could find a better name for this one - "asr" maybe? or "models" or "audio-models" maybe ?

kaldi/
  experiment1/
    ...
  experiment2/
    ...

very cool - so we can have multiple experiments side-by-side

text-corpora/ europarl-de.txt parole.txt tokenizers/ punkt.pickle

ah, I like how we can have multiple tokenizer models here if we want and still keep the directory structure clean.

TODO list. Small iterations.

You're right, a lot of scripts use the --lang argument. If we look at our 2 main uses of this repo for the near future, I come up with the following list of high priority scripts (please add more for the first use, as I'm not sure which scripts are relevant there):

I think it is a very good idea to select scripts to change with use cases in mind!

a) grow audio corpus by transcribing Podcasts with existing transcriptions

abook-transcribe.py speech_editor.py speech_lex_edit.py ...?

actually, with the latest approach I am using, neither speech_editor nor speech_lex_edit are needed, as the relevant parts of them are integrated (read: copied and pasted %-) ) into abook-transcribe

besides abook-transcribe,

speech_audio_scan.py auto_review.py

(and optionally noisy_gen.py and/or phone_gen.py are needed)

Name of tool in name of language model My example of the proposed workflow contains the language model lm-europarl-de, but this name it arbitrary. A user can name her language models however she likes on the command line.
Replace name speech I prefer to replace it with asr-models, because a) asr is too generic b) models is too generic as well; it could mean "language model" or "acoustic model" c) audio-model is not an established term in the ASR community. I've updated https://github.com/gooofy/speech/issues/13#issuecomment-376138989 accordingly.
Required scripts to grow audio corpus Ok, I've removed speech_editor.py and speech_lex_edit.py and have added speech_audio_scan.py and auto_review.py to the list of scripts in https://github.com/gooofy/speech/issues/13#issuecomment-376138989

Name of tool in name of language model My example of the proposed workflow contains the language model lm-europarl-de, but this name it arbitrary. A user can name her language models however she likes on the command line.

ah, I see! didn't realize the name was user-chosen %)

Replace name speech I prefer to replace it with asr-models, because a) asr is too generic b) models is too generic as well; it could mean "language model" or "acoustic model" c) audio-model is not an established term in the ASR community. I've updated #13 (comment) accordingly.

excellent, asr-models it is, then :)

Required scripts to grow audio corpus Ok, I've removed speech_editor.py and speech_lex_edit.py and have added speech_audio_scan.py and auto_review.py to the list of scripts in #13 (comment)

cool, thanks :)

@gooofy I just realized that I forgot to include speech_audio_scan.py into my proposed workflow. The script speech_kaldi_export.py relies on audio files in .speechrc.wav16_dir_de or .speechrc.wav16_dir_en, respectively.

Current behaviour of speech_audio_scan.py As far as I understand, these are speech_audio_scan.py's current responsibilities:

look for audio files in .speechrc.vf_audiodir_de, .speechrc.extrasdir_de, ...
convert found audio files to wav files sampled at 16kHz
write converted wav files to .speechrc.wav16_dir_de or .speechrc.wav16_dir_en (depending on --lang)
add audio file and prompt to transcripts_*.csv if it's not included yet

Suggested behaviour of speech_audio_scan.py Here's my suggestion for speech_audio_scan.py's new behaviour. Let's look an an example invocation:

$ ./speech_audio_scan.py voxforge_de gspv2

The arguments are names of variables in .speechrc corresponding to locations of speech corpora on disk, i.e. currently vf_audiodir_de, gspv2_dir, vf_audiodir_en, and librivoxdir. To be able to use variable names from .speechrc directly as arguments for scripts (without looking awkward), I propose to rename the following variables:

vf_audiodir_de -> voxforge_de
vf_contribdir_de -> voxforge_contrib_de
extrasdir_de -> audio_extras_de
gspv2_dir -> gspv2

vf_audiodir_en -> voxforge_en
extrasdir_en -> audio_extras_en
librivoxdir -> librivox

Internally the script knows how to treat each audio corpus. It knows for example that below the directory gspv2 are the directories train/, dev/, and test/. It will process all three subfolders, because the training- and test-split is performed by a downstream script (speech_kaldi_export.py).

The script will convert each found audio file to a wav file sampled at 16kHz.

It will write wav files into seperate folders, depending on the audio corpus. There will be a new variable .speechrc.wav16, making wav16_dir_de and wav16_dir_en obsolete:

wav16 = /home/bofh/data/wav16

The above example invocation would write wav files into the following folders

/home/bofh/data/wav16/voxforge_de
/home/bofh/data/wav16/gspv2

The names of the subfolders are equal to the variable names in .speechrc and equal to the script's arguments. In my opinion, that's very convenient from a user's perspective.

Lastly, the script will update the transcript databases corresponding to each speech corpus. In the example invocation, these are

src/speech/voxforge_de/transcripts_00.csv
src/speech/gspv2/transcripts_00.csv

Again, the names of the subfolders under speech/ are equal to the variable names in .speechrc and equal to ... you get the idea :)

What do you think?

Here's my suggestion for speech_audio_scan.py's new behaviour. Let's look an an example invocation:

[...]

The names of the subfolders are equal to the variable names in .speechrc and equal to the script's arguments. In my opinion, that's very convenient from a user's perspective.

this is definitely a huge step in the right direction, I like it a lot!

However, I am wondering whether we could go for a fully symmetrical approach here: right now, we have this concept of .speechrc variables on the input side but a single wav16 directory with subdirectories on the output side. Why not have a single source directory (i.e. audio_src or audio_corpus) variable in .speechrc and look for subdirectories in there just as we do on the output side?

so with these settings in ~/.speechrc:

audio_corpus = /home/bofh/data/audio_corpus
wav16 = /home/bofh/data/wav16

the example invocation

$ ./speech_audio_scan.py voxforge_de gspv2

would look for source audio files in

/home/bofh/data/audio_corpus/voxforge_de
/home/bofh/data/audio_corpus/gspv2

and put the generated wav files in

/home/bofh/data/wav16/voxforge_de
/home/bofh/data/wav16/gspv2

If users want to distribute the source audio files on their disk, they can simply use symlinks to point to them.

Is the new file layout ready for experimentation? If this would be too early, please let me know. (My kaldi installation was broken after an upgrade form Ubuntu 17.10 to Ubuntu 18.04. So I had to rebuild anyway and thought about testing the new version of zamia-speech. Fortunately, the model files turned out to be compatible with the rebuilt version.)

Yes, the new file layout is ready for experimentation. Guenter has merged my pull request containing the corresponding changes. @svenha Have fun with the new scripts :D

gooofy / zamia-speech