Closed mpuels closed 6 years ago
Excellent suggestions. It will also simplify adding local resources.
One detail question: You mentioned a "corrected version of the TU-Darmstadt corpus (gspv2)". What kind of corrections are included? When are these applied?
By "corrected version" I mean that not all audio segments contained in the original corpus are put into Kaldi's wav.scp
for training or testing:
According to Transcript.split()
only segments in data/src/speech/de/transcripts_*.csv
with quality >= 2 are considered. I reckon the repo owner filters out segments where prompt and actual recorded utterance don't match.
@mpuels excellent ideas and a very nice write-up! couple of remarks from my side:
with arbitrary model names we will have to have a command line option to specify which dictionary to use when exporting the model. we could even think about allowing multiple dictionaries for multi-language models (e.g. a model that understands english and german)?
we have discussed augmenting some corpora with filtered or distorted derivatives of the audio contained - I am not sure if we should outline that in the workflow at this point or just postpone the idea for now?
I feel this is a pretty big task (since nearly all scripts will be affected) which we should break down into smaller ones if possible. I was wondering if it would make sense to outline the structure of the data/src, data/dst and external ... directories first (i.e. where which model will live, where we will have additional config files - if any, where generated files will live ...) - so we have some sort of reference map when we're working on the individual scripts.
Also, I wonder if it would make sense to create a TODO list of all scripts affected (i.e. grep for --lang option across all scripts :o) ) and sort that list by priority - is there a minimum set of scripts we can adapt first so we get something working quickly even if it means that many things are broken at that point? - I think that could be important not only for motivation but also to get a feel for the new setup as soon as possible - what sounds good on paper doesn't necessarily translate into a good system for day-to-day use so we might want to be able to iterate these ideas early in the process.
the existing .csv databases will have to split up (voxforge, tu-darmstadt (gspv2), phone, noise, ...) - we will have to decide where the resulting partial .csv dbs will live and what they should be called
Pronunciation dictionary as command line arg.
Good point, I had overlooked it. I suggest that for the time being we add a mandatory argument --dictionary
for speech_kaldi_export.py
. Later we can think about how to deal with multiple languages and dictionaries.
Augmented corpora. As you mentioned in point 3, the scope of my proposal is already big enough, so I'd say let's postpone the scripted augmentation of corpora.
Create subtasks. Outline directory structures.
Good point on planning the directory structures. Here is the current and proposed structure for src/
and dst/
respectively. Have I missed any important files?
Current structure of src/
:
src/
└── speech
├── de
│ ├── dict.ipa
│ ├── spk2gender
│ ├── spk_test.txt
│ ├── tokenizer_errors.txt
│ ├── transcripts_00.csv
│ ├── transcripts_01.csv
│ └── transcripts_02.csv
├── en
│ ├── dict.ipa
│ ├── spk2gender
│ ├── transcripts_00.csv
│ ├── transcripts_01.csv
│ ├── transcripts_02.csv
│ └── transcripts_03.csv
├── kaldi-cmd.sh
...
Proposed structure of src/
:
src/
dicts/
dict-de.ipa
dict-en.ipa
speech/
voxforge-de/
spk2gender
spk_test.txt
transcripts_00.csv
transcripts_01.csv
gspv2/
spk2gender
spk_test.txt
transcripts_00.csv
kaldi-cmd.sh
...
Current structure of dst/
:
dst/
speech/
de/
kaldi/
...
srilm/
lm.arpa
lm_full.arpa
train_all.txt
punkt.pickle
sentences.txt
en/
...
Proposed structure of dst/
:
dst/
lm/
lm-europarl-de/
lm.arpa
lm_full.arpa
train_all.txt
lm-europarl-de-parole/
lm.arpa
lm_full.arpa
train_all.txt
asr-models/
kaldi/
experiment1/
...
experiment2/
...
text-corpora/
europarl-de.txt
parole.txt
tokenizers/
punkt.pickle
--lang
and came up with the following list of affected scripts. The scripts affected by my proposal (in section "Proposed workflow" above) are marked with an asterisk (*). Note that also the scripts speech_sentences_{de,en}.py
are affected.abook-transcribe.py
apply_review.py
auto_review.py
speech_audio_scan.py
* speech_build_lm.py
speech_deepspeech_export.py
speech_editor.py
speech_gender.py
* speech_kaldi_export.py
speech_lex_export_espeak.py
speech_lex_missing.py
speech_sequitur_export.py
speech_sphinx_export.py
speech_stats.py
You're right, a lot of scripts use the --lang
argument. If we look at our 2 main uses of this repo for the near future, I come up with the following list of high priority scripts (please add more for the first use, as I'm not sure which scripts are relevant there):
a) grow audio corpus by transcribing Podcasts with existing transcriptions
abook-transcribe.py
speech_audio_scan.py
auto_review.py
speech_editor.py
speech_lex_edit.py
b) train models based on German audio corpora (VoxForge, TU-Darmstadt, Forschergeist)
speech_sentences_{de.en}.py
speech_build_lm.py
speech_kaldi_export.py
UPDATES 2018-04-03
dst/speech
to dst/asr-models
.@mpuels: once again thank you for this impressive writeup, I like it a lot! - and sorry for my late reply, just couldn't fit in a decent sized timeslot this week to give your ideas at least a somewhat appropiate amount of thought.
I suggest that for the time being we add a mandatory argument --dictionary for speech_kaldi_export.py. Later we can think about how to deal with multiple languages and dictionaries.
very good.
As you mentioned in point 3, the scope of my proposal is already big enough, so I'd say let's postpone the scripted augmentation of corpora.
agreed.
Here is the current and proposed structure for src/ and dst/ respectively.
[...] very good, I like it a lot. some remarks on minor details:
dst/ lm/ lm-europarl-de/
I think we could include the name of the tool that was used to compute the lm here, so we could support multiple lm tools in the future, e.g. srilm-europarl-de kenlm-europarl-de
etc
speech/
not sure if we could find a better name for this one - "asr" maybe? or "models" or "audio-models" maybe ?
kaldi/ experiment1/ ... experiment2/ ...
very cool - so we can have multiple experiments side-by-side
text-corpora/ europarl-de.txt parole.txt tokenizers/ punkt.pickle
ah, I like how we can have multiple tokenizer models here if we want and still keep the directory structure clean.
TODO list. Small iterations.
You're right, a lot of scripts use the --lang argument. If we look at our 2 main uses of this repo for the near future, I come up with the following list of high priority scripts (please add more for the first use, as I'm not sure which scripts are relevant there):
I think it is a very good idea to select scripts to change with use cases in mind!
a) grow audio corpus by transcribing Podcasts with existing transcriptions
abook-transcribe.py speech_editor.py speech_lex_edit.py ...?
actually, with the latest approach I am using, neither speech_editor nor speech_lex_edit are needed, as the relevant parts of them are integrated (read: copied and pasted %-) ) into abook-transcribe
besides abook-transcribe,
speech_audio_scan.py auto_review.py
(and optionally noisy_gen.py and/or phone_gen.py are needed)
Name of tool in name of language model My example of the proposed workflow contains the language model lm-europarl-de
, but this name it arbitrary. A user can name her language models however she likes on the command line.
Replace name speech
I prefer to replace it with asr-models
, because a) asr
is too generic b) models
is too generic as well; it could mean "language model" or "acoustic model" c) audio-model
is not an established term in the ASR community. I've updated https://github.com/gooofy/speech/issues/13#issuecomment-376138989 accordingly.
Required scripts to grow audio corpus Ok, I've removed speech_editor.py
and speech_lex_edit.py
and have added speech_audio_scan.py
and auto_review.py
to the list of scripts in https://github.com/gooofy/speech/issues/13#issuecomment-376138989
Name of tool in name of language model My example of the proposed workflow contains the language model lm-europarl-de, but this name it arbitrary. A user can name her language models however she likes on the command line.
ah, I see! didn't realize the name was user-chosen %)
Replace name speech I prefer to replace it with asr-models, because a) asr is too generic b) models is too generic as well; it could mean "language model" or "acoustic model" c) audio-model is not an established term in the ASR community. I've updated #13 (comment) accordingly.
excellent, asr-models it is, then :)
Required scripts to grow audio corpus Ok, I've removed speech_editor.py and speech_lex_edit.py and have added speech_audio_scan.py and auto_review.py to the list of scripts in #13 (comment)
cool, thanks :)
@gooofy I just realized that I forgot to include speech_audio_scan.py
into my proposed workflow. The script speech_kaldi_export.py
relies on audio files in .speechrc.wav16_dir_de
or .speechrc.wav16_dir_en
, respectively.
Current behaviour of speech_audio_scan.py
As far as I understand, these are speech_audio_scan.py
's current responsibilities:
.speechrc.vf_audiodir_de
, .speechrc.extrasdir_de
, ....speechrc.wav16_dir_de
or .speechrc.wav16_dir_en
(depending on --lang
)transcripts_*.csv
if it's not included yetSuggested behaviour of speech_audio_scan.py
Here's my suggestion for speech_audio_scan.py
's new behaviour. Let's look an an example invocation:
$ ./speech_audio_scan.py voxforge_de gspv2
The arguments are names of variables in .speechrc
corresponding to locations of speech corpora on disk, i.e. currently vf_audiodir_de
, gspv2_dir
, vf_audiodir_en
, and librivoxdir
. To be able to use variable names from .speechrc
directly as arguments for scripts (without looking awkward), I propose to rename the following variables:
vf_audiodir_de -> voxforge_de
vf_contribdir_de -> voxforge_contrib_de
extrasdir_de -> audio_extras_de
gspv2_dir -> gspv2
vf_audiodir_en -> voxforge_en
extrasdir_en -> audio_extras_en
librivoxdir -> librivox
Internally the script knows how to treat each audio corpus. It knows for example that below the directory gspv2
are the directories train/
, dev/
, and test/
. It will process all three subfolders, because the training- and test-split is performed by a downstream script (speech_kaldi_export.py
).
The script will convert each found audio file to a wav file sampled at 16kHz.
It will write wav files into seperate folders, depending on the audio corpus. There will be a new variable .speechrc.wav16
, making wav16_dir_de
and wav16_dir_en
obsolete:
wav16 = /home/bofh/data/wav16
The above example invocation would write wav files into the following folders
/home/bofh/data/wav16/voxforge_de
/home/bofh/data/wav16/gspv2
The names of the subfolders are equal to the variable names in .speechrc
and equal to the script's arguments. In my opinion, that's very convenient from a user's perspective.
Lastly, the script will update the transcript databases corresponding to each speech corpus. In the example invocation, these are
src/speech/voxforge_de/transcripts_00.csv
src/speech/gspv2/transcripts_00.csv
Again, the names of the subfolders under speech/
are equal to the variable names in .speechrc
and equal to ... you get the idea :)
What do you think?
Here's my suggestion for speech_audio_scan.py's new behaviour. Let's look an an example invocation:
[...]
The names of the subfolders are equal to the variable names in .speechrc and equal to the script's arguments. In my opinion, that's very convenient from a user's perspective.
this is definitely a huge step in the right direction, I like it a lot!
However, I am wondering whether we could go for a fully symmetrical approach here: right now, we have this concept of .speechrc variables on the input side but a single wav16 directory with subdirectories on the output side. Why not have a single source directory (i.e. audio_src or audio_corpus) variable in .speechrc and look for subdirectories in there just as we do on the output side?
so with these settings in ~/.speechrc:
audio_corpus = /home/bofh/data/audio_corpus
wav16 = /home/bofh/data/wav16
the example invocation
$ ./speech_audio_scan.py voxforge_de gspv2
would look for source audio files in
/home/bofh/data/audio_corpus/voxforge_de
/home/bofh/data/audio_corpus/gspv2
and put the generated wav files in
/home/bofh/data/wav16/voxforge_de
/home/bofh/data/wav16/gspv2
If users want to distribute the source audio files on their disk, they can simply use symlinks to point to them.
Is the new file layout ready for experimentation? If this would be too early, please let me know. (My kaldi installation was broken after an upgrade form Ubuntu 17.10 to Ubuntu 18.04. So I had to rebuild anyway and thought about testing the new version of zamia-speech. Fortunately, the model files turned out to be compatible with the rebuilt version.)
Yes, the new file layout is ready for experimentation. Guenter has merged my pull request containing the corresponding changes. @svenha Have fun with the new scripts :D
Introduction
Currently, most scripts offer the argument
--lang
to choose betweende
anden
. The choicede
means that a language model is trained on the text corporaand that an acoustic model is trained on
There is no way to choose for example just one text corpus and one audio corpus. Hereby I propose to change the command line arguments of some scripts to make it simpler to pick text and audio corpora to train an ASR system.
Current workflow
An example workflow to train a German speech recognition system might look like
Under the hood
speech_sentences_de.py
extracts sentences from two text corpora (Europarl and Parole) and writes them to a text file containing one sentence per line. Both of the corpora have to be parsed in a unique way to extract sentences from them. Currently, there is no way to build a language model just on exactly one corpus (except by altering the script of course).The command
speech_build_lm.py --lang de
concatenates text files containing one sentence per line and builds a language model (3-gram) using the programngram
.The command
speech_kaldi_export.py --lang de
consumes all transcripts for the VoxForge and TU-Darmstadt corpora indata/src/speech/de/transcripts_*.csv
, the pronunciation dictionarydata/src/speech/de/dict.ipa
and the language model created in the previous step and "deploys" them tospeech/data/dst/speech/de/kaldi
in a way that adheres to the Kaldi interface (regarding directory and file structure). Currently there is no way to conveniently choose exactly one, two, or more audio corpora. But it would be nice to be able to choose a small audio corpus to do regression testing.The script
build-lm.sh
converts the language model from the ARPA format to the finite state transducerG.fst
- as Kaldi expects it.Finally,
run-chain.sh
uses Kaldi to train the acoustic model.Here is an example sequence of commands to train a complete English ASR system with Kaldi:
Proposed workflow
This is an example of the proposed workflow for creating an ASR system:
In the example we create an ASR system where the language model is based on the Europarl and Parole corpora. To train the acoustic model, the VoxForge and TU-Darmstadt corpuora are used.
The command
./speech_sentences.py TEXTCORPUS
writes the extracted sentences by convention todata/dst/text-corpora/TEXTCORPUS.txt
.The command
speech_build_lm.py TEXTCORPUS [TEXTCORPUS ...] LMNAME
expects as arguments a list of text corpora and a name for the resulting language model (LMNAME
). The files of the language model are written to the directorydata/dst/lm/LMNAME/
.The command
speech_kaldi_export.py
expects one or more--audio-corpus
, exactly one--language-model
, and exactly one--model-name MODELNAME
. The script creates a directorydata/dst/asr-models/kaldi/MODELNAME
and places all files required by Kaldi in it.@gooofy What do you think about my suggested changes?
UPDATES 2018-04-03
speech_sentences.py
can also extract sentences from VoxForge's prompts.--dictionary
tospeech_kaldi_export.py