The Precise Wakeword Model Maker takes a sparse amount of data and creates a production quality wakeword model with Mycroft Precise. It's part of the Secret Sauce AI Wakeword Project.
The Precise Wakeword Model Maker pulls out all of the tricks in AI to turn a very sparse data set into a production quality model.
It all starts with a user data collection for the wakeword and not-wakeword categories. A user can use the Wakeword Data Collector
When you don't have enough data to train a model, generate it. TTS engines are scraped similar to the data collection recipe using TTS plugins from OpenVoiceOS. The more the better!
How do you know if your test-training distibution yields the best model? When it comes to big data sets, randomly splitting it once (ie 80/20%) is usually good enough. However, when dealing with sparse data sets the initial test-training split becomes more important. By splitting the data set many times and training experimental models, the best initial data distribution can be found. This step can boost the model by as much as ~10% performance on the training set.
Only add false positives(*) to the training/test set. Why add a bunch of files that the model can classify correctly, when you can give the model lessons where it needs to improve.
Speaking of lessons, you don't learn by reading pages of a text book in a totally random order, do you? Why should a machine learning model be subjected to this added difficulty in learning? Let the machine learn with an ordered curriculum of data. This usually boosts the model's performance over the shotgun approach by 5%-10%. Not bad!
(*)NOTE: This actually worsens the raw score of the model, because it only trains and tests on hard to learn examples, instead of giving the model an easy A. But honestly, if you are getting 98% on your test and/or training set and it doesn't actually work correctly in the real world, you really need to reconsider your machine learning strategy. ;)
Gaussian noise (static) is mixed into the pre-existing audio recordings, this helps make the model more robust and helps with generalization of the model.
A user can use other noisy data sets (ie pdsounds) to generate background noise into existing audio files, further ensuring a robust model that can wake up even in noisy environments.
Precise requires Python 3.7 (for tensorflow 1.13 support)
apt-get install
these
setup.sh
will install this for ubuntu)setup.sh
will install this for ubuntu)setup.sh
will install this for ubuntu)setup.sh
will install this for ubuntu)setup.sh
will install this for ubuntu)setup.sh
will install this for ubuntu)setup.sh
will install this for ubuntu)./setup.sh
source .venv/bin/activate
pip install -r requirements_data_prep.txt --force-reinstall
(there seems to currently be an issue with some of the requirements from the original precise not working with current versions of certain packages).pip install -r TTS_generator_requirements.txt
docker build -t precise-wakeword-model-maker .
docker pull bartmoss/precise-wakeword-model-maker
docker run -it \
-v "local_directory_for_model_output:/app/out" \
-v "local_collected_audio_directory:/data" \
-v "local_directory_path_for_config/:/app/config" \
bartmoss/precise-wakeword-model-maker
config/data_prep_user_configuration.json
with the paths:
audio_source_directory
(the main directory for the recordings from wakeword_recorder
, wakeword_model_name
the name you want to give the wakeword model,pdsounds_directory
the directory to the mp3 (or wav) files: pdsounds,extra_audio_directories_to_process
, which are all of the extra audio datasets you have downloaded besides pdsounds (see Data below)config/TTS_wakeword_config.json
with your wakeword and the individual syllables of your wakeword,config/TTS_engine_config.json
with your TTS settings. By default the larynx_host
is null
, this will use the server from Neon AI, you can run Larynx yourself and update the larynx_host
to the correct host and port (ie http://127.0.0.1:5002
)Note: don't forget to activate your venv source .venv/bin/activate
Run python data_prep
to start the Precise Wakeword Model Maker, or run in the command line with arguments:
-h
or --help
-t
or --tts-generation
-b
or --base-model
-g
or --generate-data
-e
or --generate-extra
-a
or --all
5. Do it all
.Just make sure you know: it will take A LONG time to run everything.
The wakeword and wakeword syllables in config/TTS_wakeword_config.json
are used to scrape the TTS voices in config/TTS_engine_config.json
. The results will be in out/TTS_generated_converted/
.
There are three types of resulting files:
out/TTS_generated_converted/wake-word/TTS/
out/TTS_generated_converted/not-wake-word/TTS/
The syllables and sequential permutations are vital to ensure that the model doesn't get lazy and focus on parts of the wakeword, but the whole wakeword.
IMPORTANT: check each wakeword file in out/TTS_generated_converted/wake-word/TTS/
and discard any samples where the wakeword is mispronounced before moving on to any other steps.
For effective machine learning, we need to have a good training and test set. This step uses the audio collected from audio_source_directory
in config/data_prep_user_configuration.json
and generated by TTS (see above) to create 10 different distributions between the test and training set, then trains an experimental model for each and finally keeps the one with the lowest loss (the model with the highest training set accuracy) renaming the model and its ditectory of data to your wakeword_model_name
in config/data_prep_user_configuration.json
, out/wakeword_model_name/
.
The experimental directories and models are temporarily stored in out/
as experiment_n
where n is the number of the experiment.
The data is split in different ways, depending on the kind of data. This can be configured in config/data_prep_system_configuration.json
. Unless you are using another source to collect data than Wakeword Data Collector, these settings should work fine.
random_split_directories
: 80/20% totally randomlyeven_odd_split_directories
: 50/50% even-odd splittingthree_four_split_directories
: 3/4th splittingThe TTS generated data is split 80/20%.
Finally, the model will be incrementally trained to find false-positives from the random recordings (ie TV and natural conversations) in audio_source_directory/random/user_collected/
where audio_source_directory
is configured in config/data_prep_user_configuration.json
and benchmarked.
Gaussian and background noise (ie pdsounds) is mixed is mixed into the audio files to produce further audio files.
The list of directories for both are in config/data_prep_system_configuration.json
:
pdsounds_directory
in config/data_prep_user_configuration.json
, each file mixed produces 5 files with random portions of audio mixed into the background. The source_directories
are where the files are temporarily generated and the destination_directories
are where they are added into the model's data directories. This uses Precise's precise-add-noise
feature.Finally, the model is trained on this data and benchmarked.
Although a lot of training and testing has gone on by now, the model has not yet reached production quality. It is very important to incrementally test and train it on as much not-wakeword data as possible to find potential false wake ups.
You should download at least one very large data set (at least 50,000 random utterances of many people speaking into different mics), such as common voice. This data set can be in mp3 or wav format, all non-user-collected data sets are automatically converted from mp3 or even wav to wav with 16000
sample rate. Please read Data below for more information about these data sets and where to download them.
These data sets can be added into config/data_prep_user_configuration.json
where extra_audio_directories_to_process
is the list of the directories where the data sets sources are (it is important to configure the directories directly to where the mp3 or wav files can be found) and extra_audio_directories_labels
are the labels (sub directories) they will be stored into (ie non-utterances
, utterances
, etc. in out/wakeword_model_name/random/
. Each directory must have a label.
You can do it all!
Always know your escape route.
It is important to note that downloading a lot of data is vital to producing a bullet proof wake word model. In addition, it is important to note that data prep does not walk through sub directories of sound files. It only processes the top level directory. It is best to just dump audio files in the top level directory. The files can be in mp3 or wav format, data prep will convert them to wav with the the sample rate of 16000
.
The resulting model will be a TensorFlow 1.13 precise wakeword model. It can be easily run with precise-listen wakeword_model_name.net
, configured to be run in Mycroft or even converted to TensorFlow lite and be run by the TensorFlow lite runner.
Although Secret Sauce AI is always about collaboration and community, special thanks should go to Joshua Ferguson for doing so much testing and code refactoring. We also extend a very warm thanks to the folks over at Mycroft, without whom there would be no FOSS tensorflow wakeword engine.
-Bartmoss