gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
444 stars 84 forks source link

Replace arg --lang with --lang-model and --audio-corpus #13

Closed mpuels closed 6 years ago

mpuels commented 6 years ago

Introduction

Currently, most scripts offer the argument --lang to choose between de and en. The choice de means that a language model is trained on the text corpora

and that an acoustic model is trained on

There is no way to choose for example just one text corpus and one audio corpus. Hereby I propose to change the command line arguments of some scripts to make it simpler to pick text and audio corpora to train an ASR system.

Current workflow

An example workflow to train a German speech recognition system might look like

$ ./speech_sentences_de.py
$ ./speech_build_lm.py --lang de
$ ./speech_kaldi_export.py --lang de
$ cd data/dst/speech/de/kaldi
$ ./build-lm.sh
$ ./run-chain.sh

Under the hood speech_sentences_de.py extracts sentences from two text corpora (Europarl and Parole) and writes them to a text file containing one sentence per line. Both of the corpora have to be parsed in a unique way to extract sentences from them. Currently, there is no way to build a language model just on exactly one corpus (except by altering the script of course).

The command speech_build_lm.py --lang de concatenates text files containing one sentence per line and builds a language model (3-gram) using the program ngram.

The command speech_kaldi_export.py --lang de consumes all transcripts for the VoxForge and TU-Darmstadt corpora in data/src/speech/de/transcripts_*.csv, the pronunciation dictionary data/src/speech/de/dict.ipa and the language model created in the previous step and "deploys" them to speech/data/dst/speech/de/kaldi in a way that adheres to the Kaldi interface (regarding directory and file structure). Currently there is no way to conveniently choose exactly one, two, or more audio corpora. But it would be nice to be able to choose a small audio corpus to do regression testing.

The script build-lm.sh converts the language model from the ARPA format to the finite state transducer G.fst - as Kaldi expects it.

Finally, run-chain.sh uses Kaldi to train the acoustic model.

Here is an example sequence of commands to train a complete English ASR system with Kaldi:

$ ./speech_sentences_en.py
$ ./speech_build_lm.py --lang en
$ ./speech_kaldi_export.py --lang en
$ cd data/dst/speech/en/kaldi
$ ./build-lm.sh
$ ./run-chain.sh

Proposed workflow

This is an example of the proposed workflow for creating an ASR system:

$ ./speech_sentences.py europarl-de
$ ./speech_sentences.py parole
$ ./speech_sentences.py voxforge-de-prompts
$ ./speech_build_lm.py europarl-de \
                       parole \
                       voxforge-de-prompts \
                       lm-europarl-de-parole-voxforge-de-prompts
$ ./speech_kaldi_export.py --audio-corpus voxforge \
                           --audio-corpus tu-darmstadt \
                           --language-model lm-europarl-de-parole-voxforge-de-prompts \
                           --dictionary dict-de.ipa \
                           --model-name experiment1
$ cd data/dst/speech/kaldi/experiment1
$ ./build-lm.sh
$ ./run-chain.sh

In the example we create an ASR system where the language model is based on the Europarl and Parole corpora. To train the acoustic model, the VoxForge and TU-Darmstadt corpuora are used.

The command ./speech_sentences.py TEXTCORPUS writes the extracted sentences by convention to data/dst/text-corpora/TEXTCORPUS.txt.

The command speech_build_lm.py TEXTCORPUS [TEXTCORPUS ...] LMNAME expects as arguments a list of text corpora and a name for the resulting language model (LMNAME). The files of the language model are written to the directory data/dst/lm/LMNAME/.

The command speech_kaldi_export.py expects one or more --audio-corpus, exactly one --language-model, and exactly one --model-name MODELNAME. The script creates a directory data/dst/asr-models/kaldi/MODELNAME and places all files required by Kaldi in it.

@gooofy What do you think about my suggested changes?

UPDATES 2018-04-03

svenha commented 6 years ago

Excellent suggestions. It will also simplify adding local resources.

One detail question: You mentioned a "corrected version of the TU-Darmstadt corpus (gspv2)". What kind of corrections are included? When are these applied?

mpuels commented 6 years ago

By "corrected version" I mean that not all audio segments contained in the original corpus are put into Kaldi's wav.scp for training or testing:

https://github.com/gooofy/speech/blob/8060352ecd6e02a499a8da033a5513ed922c5f86/speech_transcripts.py#L143

According to Transcript.split() only segments in data/src/speech/de/transcripts_*.csv with quality >= 2 are considered. I reckon the repo owner filters out segments where prompt and actual recorded utterance don't match.

gooofy commented 6 years ago

@mpuels excellent ideas and a very nice write-up! couple of remarks from my side:

mpuels commented 6 years ago
  1. Pronunciation dictionary as command line arg. Good point, I had overlooked it. I suggest that for the time being we add a mandatory argument --dictionary for speech_kaldi_export.py. Later we can think about how to deal with multiple languages and dictionaries.

  2. Augmented corpora. As you mentioned in point 3, the scope of my proposal is already big enough, so I'd say let's postpone the scripted augmentation of corpora.

  3. Create subtasks. Outline directory structures. Good point on planning the directory structures. Here is the current and proposed structure for src/ and dst/ respectively. Have I missed any important files?

Current structure of src/:

src/
└── speech
    ├── de
    │   ├── dict.ipa
    │   ├── spk2gender
    │   ├── spk_test.txt
    │   ├── tokenizer_errors.txt
    │   ├── transcripts_00.csv
    │   ├── transcripts_01.csv
    │   └── transcripts_02.csv
    ├── en
    │   ├── dict.ipa
    │   ├── spk2gender
    │   ├── transcripts_00.csv
    │   ├── transcripts_01.csv
    │   ├── transcripts_02.csv
    │   └── transcripts_03.csv
    ├── kaldi-cmd.sh
...

Proposed structure of src/:

src/
  dicts/
    dict-de.ipa
    dict-en.ipa
  speech/
    voxforge-de/
      spk2gender
      spk_test.txt
      transcripts_00.csv
      transcripts_01.csv
    gspv2/
      spk2gender
      spk_test.txt
      transcripts_00.csv
  kaldi-cmd.sh
  ...

Current structure of dst/:

dst/
  speech/
    de/
      kaldi/
        ...
      srilm/
        lm.arpa
        lm_full.arpa
        train_all.txt
      punkt.pickle
      sentences.txt
    en/
      ...

Proposed structure of dst/:

dst/
  lm/
    lm-europarl-de/
      lm.arpa
      lm_full.arpa
      train_all.txt
    lm-europarl-de-parole/
      lm.arpa
      lm_full.arpa
      train_all.txt
  asr-models/
    kaldi/
      experiment1/
        ...
      experiment2/
        ...
  text-corpora/
    europarl-de.txt
    parole.txt
  tokenizers/
    punkt.pickle
  1. TODO list. Small iterations. I've grepped through all Python scripts for --lang and came up with the following list of affected scripts. The scripts affected by my proposal (in section "Proposed workflow" above) are marked with an asterisk (*). Note that also the scripts speech_sentences_{de,en}.py are affected.
abook-transcribe.py
apply_review.py
auto_review.py
speech_audio_scan.py
* speech_build_lm.py
speech_deepspeech_export.py
speech_editor.py
speech_gender.py
* speech_kaldi_export.py
speech_lex_export_espeak.py
speech_lex_missing.py
speech_sequitur_export.py
speech_sphinx_export.py
speech_stats.py

You're right, a lot of scripts use the --lang argument. If we look at our 2 main uses of this repo for the near future, I come up with the following list of high priority scripts (please add more for the first use, as I'm not sure which scripts are relevant there):

a) grow audio corpus by transcribing Podcasts with existing transcriptions

b) train models based on German audio corpora (VoxForge, TU-Darmstadt, Forschergeist)

  1. Split of transcription database. See point 3 above.

UPDATES 2018-04-03

gooofy commented 6 years ago

@mpuels: once again thank you for this impressive writeup, I like it a lot! - and sorry for my late reply, just couldn't fit in a decent sized timeslot this week to give your ideas at least a somewhat appropiate amount of thought.

I suggest that for the time being we add a mandatory argument --dictionary for speech_kaldi_export.py. Later we can think about how to deal with multiple languages and dictionaries.

very good.

As you mentioned in point 3, the scope of my proposal is already big enough, so I'd say let's postpone the scripted augmentation of corpora.

agreed.

Here is the current and proposed structure for src/ and dst/ respectively.

[...] very good, I like it a lot. some remarks on minor details:

dst/ lm/ lm-europarl-de/

I think we could include the name of the tool that was used to compute the lm here, so we could support multiple lm tools in the future, e.g. srilm-europarl-de kenlm-europarl-de

etc

speech/

not sure if we could find a better name for this one - "asr" maybe? or "models" or "audio-models" maybe ?

kaldi/
  experiment1/
    ...
  experiment2/
    ...

very cool - so we can have multiple experiments side-by-side

text-corpora/ europarl-de.txt parole.txt tokenizers/ punkt.pickle

ah, I like how we can have multiple tokenizer models here if we want and still keep the directory structure clean.

TODO list. Small iterations.

You're right, a lot of scripts use the --lang argument. If we look at our 2 main uses of this repo for the near future, I come up with the following list of high priority scripts (please add more for the first use, as I'm not sure which scripts are relevant there):

I think it is a very good idea to select scripts to change with use cases in mind!

a) grow audio corpus by transcribing Podcasts with existing transcriptions

abook-transcribe.py speech_editor.py speech_lex_edit.py ...?

actually, with the latest approach I am using, neither speech_editor nor speech_lex_edit are needed, as the relevant parts of them are integrated (read: copied and pasted %-) ) into abook-transcribe

besides abook-transcribe,

speech_audio_scan.py auto_review.py

(and optionally noisy_gen.py and/or phone_gen.py are needed)

mpuels commented 6 years ago
  1. Name of tool in name of language model My example of the proposed workflow contains the language model lm-europarl-de, but this name it arbitrary. A user can name her language models however she likes on the command line.

  2. Replace name speech I prefer to replace it with asr-models, because a) asr is too generic b) models is too generic as well; it could mean "language model" or "acoustic model" c) audio-model is not an established term in the ASR community. I've updated https://github.com/gooofy/speech/issues/13#issuecomment-376138989 accordingly.

  3. Required scripts to grow audio corpus Ok, I've removed speech_editor.py and speech_lex_edit.py and have added speech_audio_scan.py and auto_review.py to the list of scripts in https://github.com/gooofy/speech/issues/13#issuecomment-376138989

gooofy commented 6 years ago

Name of tool in name of language model My example of the proposed workflow contains the language model lm-europarl-de, but this name it arbitrary. A user can name her language models however she likes on the command line.

ah, I see! didn't realize the name was user-chosen %)

Replace name speech I prefer to replace it with asr-models, because a) asr is too generic b) models is too generic as well; it could mean "language model" or "acoustic model" c) audio-model is not an established term in the ASR community. I've updated #13 (comment) accordingly.

excellent, asr-models it is, then :)

Required scripts to grow audio corpus Ok, I've removed speech_editor.py and speech_lex_edit.py and have added speech_audio_scan.py and auto_review.py to the list of scripts in #13 (comment)

cool, thanks :)

mpuels commented 6 years ago

@gooofy I just realized that I forgot to include speech_audio_scan.py into my proposed workflow. The script speech_kaldi_export.py relies on audio files in .speechrc.wav16_dir_de or .speechrc.wav16_dir_en, respectively.

Current behaviour of speech_audio_scan.py As far as I understand, these are speech_audio_scan.py's current responsibilities:

Suggested behaviour of speech_audio_scan.py Here's my suggestion for speech_audio_scan.py's new behaviour. Let's look an an example invocation:

$ ./speech_audio_scan.py voxforge_de gspv2

The arguments are names of variables in .speechrc corresponding to locations of speech corpora on disk, i.e. currently vf_audiodir_de, gspv2_dir, vf_audiodir_en, and librivoxdir. To be able to use variable names from .speechrc directly as arguments for scripts (without looking awkward), I propose to rename the following variables:

vf_audiodir_de -> voxforge_de
vf_contribdir_de -> voxforge_contrib_de
extrasdir_de -> audio_extras_de
gspv2_dir -> gspv2

vf_audiodir_en -> voxforge_en
extrasdir_en -> audio_extras_en
librivoxdir -> librivox

Internally the script knows how to treat each audio corpus. It knows for example that below the directory gspv2 are the directories train/, dev/, and test/. It will process all three subfolders, because the training- and test-split is performed by a downstream script (speech_kaldi_export.py).

The script will convert each found audio file to a wav file sampled at 16kHz.

It will write wav files into seperate folders, depending on the audio corpus. There will be a new variable .speechrc.wav16, making wav16_dir_de and wav16_dir_en obsolete:

wav16 = /home/bofh/data/wav16

The above example invocation would write wav files into the following folders

/home/bofh/data/wav16/voxforge_de
/home/bofh/data/wav16/gspv2

The names of the subfolders are equal to the variable names in .speechrc and equal to the script's arguments. In my opinion, that's very convenient from a user's perspective.

Lastly, the script will update the transcript databases corresponding to each speech corpus. In the example invocation, these are

src/speech/voxforge_de/transcripts_00.csv
src/speech/gspv2/transcripts_00.csv

Again, the names of the subfolders under speech/ are equal to the variable names in .speechrc and equal to ... you get the idea :)

What do you think?

gooofy commented 6 years ago

Here's my suggestion for speech_audio_scan.py's new behaviour. Let's look an an example invocation:

[...]

The names of the subfolders are equal to the variable names in .speechrc and equal to the script's arguments. In my opinion, that's very convenient from a user's perspective.

this is definitely a huge step in the right direction, I like it a lot!

However, I am wondering whether we could go for a fully symmetrical approach here: right now, we have this concept of .speechrc variables on the input side but a single wav16 directory with subdirectories on the output side. Why not have a single source directory (i.e. audio_src or audio_corpus) variable in .speechrc and look for subdirectories in there just as we do on the output side?

so with these settings in ~/.speechrc:

audio_corpus = /home/bofh/data/audio_corpus
wav16 = /home/bofh/data/wav16

the example invocation

$ ./speech_audio_scan.py voxforge_de gspv2

would look for source audio files in

/home/bofh/data/audio_corpus/voxforge_de
/home/bofh/data/audio_corpus/gspv2

and put the generated wav files in

/home/bofh/data/wav16/voxforge_de
/home/bofh/data/wav16/gspv2

If users want to distribute the source audio files on their disk, they can simply use symlinks to point to them.

svenha commented 6 years ago

Is the new file layout ready for experimentation? If this would be too early, please let me know. (My kaldi installation was broken after an upgrade form Ubuntu 17.10 to Ubuntu 18.04. So I had to rebuild anyway and thought about testing the new version of zamia-speech. Fortunately, the model files turned out to be compatible with the rebuilt version.)

mpuels commented 6 years ago

Yes, the new file layout is ready for experimentation. Guenter has merged my pull request containing the corresponding changes. @svenha Have fun with the new scripts :D