kahrendt / microWakeWord

A TensorFlow based wake word detection training framework using synthetic sample generation suitable for certain microcontrollers.
Apache License 2.0
257 stars 18 forks source link

microWakeWord

microWakeWord is an open-source wakeword library for detecting custom wake words on low power devices. It produces models that are suitable for using TensorFlow Lite for Microcontrollers. The models are suitable for real-world usage with low false accept and false reject rates.

microWakeword is currently available as a very early release. microWakeWord can generate features and train models. It does not include sample generation or audio augmentations. The training process produces usable models if you manually fine-tune penalty weights.

Benchmarks

Benchmarking and comparing wake word models is challenging. It is hard to account for all the different operating environments. Picovoice has provided one benchmark for at least one point of comparison. For a more rigorous false acceptance metric, we also test on the Dinner Party Corpus dataset.

Okay Nabu

The following graph depicts the false-accept/false-reject rate for the "Okay Nabu" model. Note that the test clips used in the benchmark are created with Piper sample generator, not real voice samples. FPR/FRR curve for "Okay Nabu" pre-trained model

The default parameters (probablity cutoff of 0.5 and average window size of 10) has a false rejection rate of 2% and 0.122 false accepts per hour with the Picovoice benchmark dataset. There are 0.187 false accepts per hour on the Dinner Party Corpus with these settings.

Hey Jarvis

The following graph depicts the false-accept/false-reject rate for the "Hey Jarvis" model. Note that the test clips used in the benchmark are created with Piper sample generator, not real voice samples. FPR/FRR curve for "Hey Jarvis" pre-trained model

The default parameters (probablity cutoff of 0.5 and average window size of 10) has a false rejection rate of 0.67% and 0.081 false accepts per hour with the Picovoice benchmark dataset. There are 0.375 false accepts per hour on the Dinner Party Corpus with these settings.

Alexa

The following graph depicts the false-accept/false-reject rate for the "Alexa" model. The positive samples are real recordings sources from the Picovoice repository. FPR/FRR curve for "Alexa" pre-trained model

The default parameters (probability cutoff of 0.66 and average window size of 10) has a false rejection rate of 3.49% and 0.486 false accepts per hour with the Picovoice benchmark dataset. There are 0.187 false accepts per hour on the Dinner Party Corpus with these settings.

Detection Process

We detect the wake word in two stages. Raw audio data is processed into 40 features every 20 ms. These features construct a spectrogram. The streaming inference model uses the newest slice of feature data as input and returns a probability that the wake word is said. If the model consistently predicts the wake word over multiple windows, then we predict that the wake word has been said.

The first stage processes the raw monochannel audio data at a sample rate of 16 kHz via the micro_speech preprocessor. The preprocessor generates 40 features over 30 ms (the window duration) of audio data. The preprocessor generates these features every 20 ms (the stride duration), so the first 10 ms of audio data is part of the previous window. This process is similar to calculating a Mel spectrogram for the audio data, but it is lightweight for devices with limited processing power. See the linked TFLite Micro example for full details on how the audio is processed.

The streaming model performs inferences every 20 ms on the newest audio stride. The model is based on an inception neural network converted for streaming. Streaming and training the model uses heavily modified open-sourced code from Google Research found in the paper Streaming Keyword Spotting on Mobile Devices by Rykabov, Kononenko, Subrahmanya, Visontai, and Laurenzo.

Model and Training Design Notes

Inception Based Model

Training Process

Model Training Process

We generate positive and negative samples using openWakeWord, which relies on Piper sample generator. We also use openWakeWord's data tools to augment the positive and negative samples. Additional data sources are used for negative data. Currently, microWakeWord does support these steps directly.

Audio samples are converted to features are stored as Ragged Mmaps. Currently, only converting wav audio files are supported and no direct audio augmentations are applied.

from microwakeword.feature_generation import generate_features_for_folder

generate_features_for_folder(path_to_audio='audio_samples/training', features_output_dir='audio_features/training', set_name='audio_samples')

Training configuration options are stored in a yaml file.

# Save a yaml config that controls the training process
import yaml
import os

config = {}

config['train_dir'] = 'trained_models/alexa'

# Each feature_dir should have at least one of the following folders with this structure:
#  training/
#    ragged_mmap_folders_ending_in_mmap
#  testing/
#    ragged_mmap_folders_ending_in_mmap
#  testing_ambient/
#    ragged_mmap_folders_ending_in_mmap
#  validation/
#    ragged_mmap_folders_ending_in_mmap
#  validation_ambient/
#    ragged_mmap_folders_ending_in_mmap
#
#  sampling_weight: Weight for choosing a spectrogram from this set in the batch
#  penalty_weight: Penalizing weight for incorrect predictions from this set
#  truth: Boolean whether this set has positive samples or negative samples
#  truncation_strategy = If spectrograms in the set are longer than necessary for training, how are they truncated
#       - random: choose a random portion of the entire spectrogram - useful for long negative samples
#       - truncate_start: remove the start of the spectrogram
#       - truncate_end: remove the end of the spectrogram
#       - split: Split the longer spectrogram into separate spectrograms offset by 100 ms. Only for ambient sets

config['features'] = [
        {
            'features_dir': '/Volumes/MachineLearning/training_data/alexa_4990ms_spectrogram/generated_positive',
            'sampling_weight': 0.25,
            'penalty_weight': 1,
            'truth': True,
            'truncation_strategy': 'truncate_start'
        },
        {
            'features_dir': '/Volumes/MachineLearning/training_data/alexa_4990ms_spectrogram/generated_negative',
            'sampling_weight': 0.25,
            'penalty_weight': 1,
            'truth': False,
            'truncation_strategy': 'truncate_start'
        },
        {
            'features_dir': '/Volumes/MachineLearning/training_data/english_speech_background_1970ms',
            'sampling_weight': 0.2,
            'penalty_weight': 3,
            'truth': False,
            'truncation_strategy': 'random'
        },
        {
            'features_dir': '/Volumes/MachineLearning/training_data/cv_corpus_background',
            'sampling_weight': 0.10,
            'penalty_weight': 2,
            'truth': False,
            'truncation_strategy': 'random'
        },
        {
            'features_dir': '/Volumes/MachineLearning/training_data/no_speech_background_1970ms',
            'sampling_weight': 0.2,
            'penalty_weight': 3,
            'truth': False,
            'truncation_strategy': 'random'
        },
        {
            'features_dir': '/Volumes/MachineLearning/training_data/ambient_background',
            'sampling_weight': 0.2,
            'penalty_weight': 2,
            'truth': False,
            'truncation_strategy': 'split'
        },
    ]

# Number of training steps in each iteration - various other settings are configured as lists that corresponds to different steps
config['training_steps'] = [20000, 20000, 20000]        

# Penalizing weight for incorrect class predictions - lists that correspond to training steps
config["positive_class_weight"] = [1]               
config["negative_class_weight"] = [1]
config['learning_rates'] = [0.001, 0.0005, 0.00025]     # Learning rates for Adam optimizer - list that corresponds to training steps
config['batch_size'] = 100

config['mix_up_augmentation_prob'] =  [0]       # Probability of applying MixUp augmentation - list that corresponds to training steps
config['freq_mix_augmentation_prob'] = [0]      # Probability of applying FreqMix augmentation - list that corresponds to training steps
config['time_mask_max_size'] = [5]              # SpecAugment - list that corresponds to training steps
config['time_mask_count'] = [2]                 # SpecAugment - list that corresponds to training steps
config['freq_mask_max_size'] = [5]              # SpecAugment - list that corresponds to training steps
config['freq_mask_count'] = [2]                 # SpecAugment - list that corresponds to training steps
config['eval_step_interval'] = 500              # Test the validation sets after every this many steps

config['clip_duration_ms'] = 1490   # Maximum length of wake word that the streaming model will accept
config['window_stride_ms'] = 20     # Fixed setting for default feature generator
config['window_size_ms'] = 30       # Fixed setting for default feature generator
config['sample_rate'] = 16000       # Fixed setting for default feature generator

# The best model weights are chosen first by minimizing the specified minimization metric below the specified target_minimization
# Once the target has been met, it chooses the maximum of the maximization metric. Set 'minimization_metric' to None to only maximize
# Available metrics:
#   - "loss" - cross entropy error on validation set
#   - "accuracy" - accuracy of validation set
#   - "recall" - recall of validation set
#   - "precision" - precision of validation set
#   - "false_positive_rate" - false positive rate of validation set
#   - "false_negative_rate" - false negative rate of validation set
#   - "ambient_false_positives" - count of false positives from the split validation_ambient set
#   - "ambient_false_positives_per_hour" - estimated number of false positives per hour on the split validation_ambient set
config['minimization_metric'] = 'ambient_false_positives_per_hour'  # Set to N
config['target_minimization'] = 0.5
config['maximization_metric'] = 'accuracy'

with open(os.path.join('training_parameters.yaml'), 'w') as file:
    documents = yaml.dump(config, file)

The model's hyperparameters are specified when calling the training script.

!python -m microwakeword.model_train_eval \
--training_config='training_parameters.yaml' \
--train 1 \
--restore_checkpoint 1 \
--test_tf_nonstreaming 0 \
--test_tflite_nonstreaming 0 \
--test_tflite_streaming 0 \
--test_tflite_streaming_quantized 1 \
inception \
--cnn1_filters '32' \
--cnn1_kernel_sizes '5' \
--cnn1_subspectral_groups '4' \
--cnn2_filters1 '24,24,24' \
--cnn2_filters2 '32,64,96' \
--cnn2_kernel_sizes '3,5,5' \
--cnn2_subspectral_groups '1,1,1' \
--cnn2_dilation '1,1,1' \
--dropout 0.8 

Acknowledgements

I am very thankful for many people's support to help improve this! Thank you, in particular, to the following individuals and organizations for providing feedback, collaboration, and developmental support: