Add all-in-one training notebook

kahrendt commented 5 months ago

Add a notebook that downloads various data sets, generates samples, augments samples, generates features, trains, and downloads the quantized streaming model.

[x] Handle audio augmentations
[ ] Handle sample generation directly
[ ] Upload negative training spectrogram features generated from various datasets
[ ] Add notebook that generate samples, augments samples, downloads negative training features, and trains a model
[ ] Add code to tune probability cutoffs to balance false accept rates and false rejection rates for a given model

sammcj commented 4 months ago

This would be really nice!

openWakeWord has a great notebook available on colab - https://colab.research.google.com/drive/1q1oe2zOyZp7UsB3jJiQ1IFn8z5YfjwEb?usp=sharing#scrollTo=1cbqBebHXjFD

I was easily able to use it to train a model for "hey ollama" without any issues, but when trying to train microwakeword I keep hitting problems.

For example, it assumes I have files such as:

/work/microWakeWord/training_data/alexa_4990ms_spectrogram/generated_positive

But it's not clear where to get those / how to generate them.

kahrendt commented 4 months ago

It is my end goal to have a very similar notebook to openWakeWord. I am not sure microWakeWord will ever be suitable for running in Colab, as it requires longer training times than what is available for free from Google. However, a decent home computer should be able to handle the load given enough time. I've provided the current notebook as an example for how to setup the training environment, but as you point out, it assumes you already have generated features to work with.

PR #14 made progress towards this goal, but it still has a ways to go. Here is the current to do list to close this issue:

[x] Handle audio augmentations
[ ] Handle sample generation directly
[ ] Upload negative training spectrogram features generated from various datasets
[ ] Add notebook that generate samples, augments samples, downloads negative training features, and trains a model
[ ] Add code to tune probability cutoffs to balance false accept rates and false rejection rates for a given model

sammcj commented 4 months ago

Thank you so much for taking the time to respond and for updated notebooks, I'll check them out!

sammcj commented 4 months ago

I got a lot further with the feature_generate note book.

A couple of things I noticed:

Downloading the fma with streaming seemed to be broken, my fix was:

output_dir = "./fma"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
    # The FMA dataset loading without streaming due to the issue
    fma_dataset = datasets.load_dataset("rudraml/fma", name="small", split="train", trust_remote_code=True)
    fma_dataset = fma_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000))

    n_hours = 1  # use only 1 hour of clips for this example notebook, recommend increasing for full-scale training
    processed = 0
    for row in tqdm(fma_dataset):
        if processed >= n_hours * 3600 // 30:
            break
        name = row['audio']['path'].split('/')[-1].replace(".mp3", ".wav")
        scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, (row['audio']['array']*32767).astype(np.int16))
        processed += 1

audio_config['input_path'] = 'generated_samples' <- Stuck here as I'm assuming I need to get the training_notebook working first to generate these samples?

--

For the training_notebook, two things:

It's not clear what config['features'] is supposed to be set with, how are these files generated etc...
As the notebooks are in a subdirectory, the microWakeWord module is not available, I think the workaround might be to add a !pip install -U git+https://github.com/kahrendt/microWakeWord or a sys.path.append('..') 🤔

kahrendt commented 4 months ago

Thanks for trying it out! The issues you found are all related things being unfinished (except for the streaming FMA dataset issue, I will have to look into that some more). You do need to create positive and negative samples before using the feature_generation notebook (if you really want try it, you could use openWakeWord to generate the samples until I add code to do it directly within microWakeWord).

The config['features'] lists all the data sources used in training. Typically, this includes both augmented generated samples and features generated from various datasets. I have thousands of hours of negative audio samples that I generated features for (see data_sources.md for a list) to use as negative data. I am trying to determine a smaller subset of those sources that will still produce usable models. When I have done that, I will upload their features to HuggingFace so people do not have to download all the raw audio data in the first place.

I intend to add a pip install line to the final training script to handle that last issue you identified (I added preliminary package support in PR #11).

sammcj commented 4 months ago

By chance - Would it be possible for you to generate a microwave word for “hey_ollama”?

Llama is getting really big in the Home Assistant integration community, but I keep seeing videos of people asking Alexa or Nabu and it just doesn’t feel right 🤣

sammcj commented 4 months ago

Otherwise - when I actually get it working I plan on generating quite a few different options and open sourcing them to the community 😄

Here's mine working with OpenWakeWord (on my server):

kahrendt commented 4 months ago

By chance - Would it be possible for you to generate a microwave word for “hey_ollama”?

I will add it to the list of words I want to make. I still haven't generalized the process well enough that I can create these without much effort (also why I haven't made an all in one script available!), so it takes a lot of manually tuning to get a usable model. So, it may take some time before it's ready!

sammcj commented 2 months ago

@kahrendt did you ever get a chance to add a hey_ollama wake word model?

kahrendt commented 2 months ago

Unfortunately, not yet. I'm still struggling to get the training results to consistently yield models that generalize well to real-world voice samples.

berapp commented 1 month ago

In an effort to try to build some more wakewords I've been trying to setup a build environment, but I seem to be stuck in dependency hell.

Collecting attrs==19.3.0 Using cached attrs-19.3.0-py2.py3-none-any.whl (39 kB) Installing collected packages: attrs Attempting uninstall: attrs Found existing installation: attrs 23.2.0 Uninstalling attrs-23.2.0: Successfully uninstalled attrs-23.2.0 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. jsonschema 4.19.2 requires attrs>=22.2.0, but you have attrs 19.3.0 which is incompatible. referencing 0.35.1 requires attrs>=22.2.0, but you have attrs 19.3.0 which is incompatible. Successfully installed attrs-19.3.0

kahrendt commented 1 month ago

In an effort to try to build some more wakewords I've been trying to setup a build environment, but I seem to be stuck in dependency hell.

Funny enough, I just encountered this today when setting up the environment from scratch. Fortunately, it doesn't seem to affect the training process. I don't use either of these packages directly, so one of the dependencies must be causing the issue.

A quick update on getting more consistent results when training:

The background audio augmentations need to be applied at a very high probability (say 0.95) to help avoid overfitting to TTS samples
The SNR ratio for background audio should be reduced as well to help with this (this currently isn't exposed as an option in the ClipsHandler class yet)
Set max_start_time_from_right_s as high as possible while still fitting in the initial window. The default Inception settings should be close enough.

With these modifications, I am consistently getting decent models. The harder part is now determining which model is best while training, as the various metrics on the TTS samples are not perfectly indicative of the model's ability to generalize to speech samples. It is still decent overall, but two models with essentially the same metrics may have a false rejection rate of 3% vs 10% at the same false accepts per hour level. I still need to figure out how to better predict which one is best.

I am working on updating the code base to take all of this into account. I will also be adding some additional model architectures that have faster inference times on the ESP32 while using less memory. I don't have a timeline quite yet for these changes, but now that I am getting much more consistent results, progress should go faster.

Omnipius commented 1 month ago

@kahrendt Do you have a dev branch with these modifications and/or a draft of the AIO notebook?

I've been trying to get training up and running on my local system. I've gotten as far as generating and augmenting positive and negative samples for training, test, and validation into mmap. Training loads my spectrograms, but then presently fails. I think that's just an issue with the new Keras 3. So, dependency tracking is going to be an issue.

What I'm definitely missing are all of the negative datasets that are called out on the training notebook. Is it assumed that we've already downloaded these and converted them to features? Any suggestions on how to recreate the datasets you're using?

kahrendt commented 1 month ago

I've just quickly put this HuggingFace repo together, and I will improve the documentation! This still isn't easy to use, I apologize for that.

Big huge note, you will need to modify the mWW code to use these! To save disk space, I store these datasets in uint16 format. The training process expects the input data to be float and scaled by a factor of 0.0390625.

I haven't tested the following code changes fully, as my working branch is heavily modified compared to the current state on GitHub. In data.py line 442, add

if np.issubdtype(spectrogram_1.dtype, np.uint16):
    spectrogram_1 = spectrogram_1.astype(np.float32) * 0.0390625

At line 500 and 532, add

if np.issubdtype(spectrogram.dtype, np.uint16):
    spectrogram = spectrogram.astype(np.float32) * 0.0390625

I may have missed something here, so I'm sorry if it doesn't work as written! I am working on cleaning up my working branch to update the current code, but that will still take some time.

My latest test runs have the following settings: dinner_party_background has sampling weight 3 and penalty weight 1 speech_background has sampling weight 6 and penalty weight 1 no_speech_background has sampling weight 2 and penalty weight 1 generated positive samples has a sampling weight of 1 and penalty weight 1 generated negative samples has a sampling weight of 1 and penalty weight 1

These hyperparameters probably need tweaking!

The rough-testing branch is fairly close to the state of my current setup, but there is a lot of stuff hacked together in it with few comments. A lot of the notebooks are a mess of things as well... It does have the scaling stuff already implemented however.

Omnipius commented 1 month ago

Thanks. That got my training up and working in a sensible manner. I'll take a look at that branch for some hyperparameter clues. It does seem like this could benefit from some early stopping and dynamic learning rate approaches to speed things up.

What do you consider to be acceptable performance?

BTW: Due to changes in Keras as of TF 2.16, you'll want to either migrate TF ops to Keras 3 or add tf-keras to dependencies and set os.environ['TF_USE_LEGACY_KERAS'] = '1' sometime before anything TF related is imported.

kahrendt commented 1 month ago

Thanks for the insight with the legacy Keras environmental variable. I've been staying on TF 2.14, so I haven't looked into it too much overall.

The biggest issue is that it is hard to know when to stop. The validation set's accuracy continues to increase and loss continues to decrease with more training, but the accuracy on real samples decreases (though using small SNR does reduce if not eliminate this). Even then, it is still hard to determine which model is the best. I've been training Alexa models repeatedly as a test a case since the Picovoice benchmark repository has around 300 real samples to work with. In training run form last week, two checkpoints only 500 iterations apart had essentially the same metrics on the validation set. On the Picovoice benchmark, one had a 20% false rejection rate at 0.5 false accepts per hour, while the other had a 5% false rejection rate at 0.5 false accepts per hour.

My original goal is to match openWakeWord's standard of 5% false rejection rate at 0.5 false accepts hour in loud rooms. I would really like it to be a 5% false reject rate at 0.1 false accepts per hour, but determining the. best model makes that challenging. I have achieved this a couple times with "Alexa", but it required essentially using the real samples as the validation set. I would be more confident in the model if the test set is completely independent of the training process. Unfortunately, most phrases do not have a large number of examples readily available.

Sorry I keep rambling here! But I would like to hear if you have any potential insights on how to better predict what is the best model? I have tried the following things that have helped, but I would love to hear any other suggestions! 1) Use the Home Assistant Cloud voices to generate samples in different accents (this uses Azure under the hood) and augment them for the validation set. This has helped, the various validation metrics are at least correlated with the real sample performance (whereas a validation set with Piper generated samples was essentially random), but it isn't a perfect relationship. 2) I am trying to maximize an AUC measurement. The validation_ambient set gives a rough estimate of how many false accepts per hour there are at any given cutoff. I then compute the recall at each of those cutoffs on the validation set. I restrict it to have false accepts per hour between 0 and 2.0 and use the area under that curve. This metric better reflects practical real world use.

Omnipius commented 1 month ago

You're very welcome. That little 'enhancement' caused several days of headache at work about a month back.

Another quirk I came across in augmentation and feature generation is that PySoundfile sometimes trips over itself and fails to load the audio with a 'NoBackend' error. I 'fixed' this by putting a try-except statement in generate_augmented_spectrogram which just recursively calls itself if augmentation fails. That's not a great solution since there's no protection from an infinite loop if something is genuinely broken, but it got me past that persistent hiccup. Probably should replace that with something that limits the number of retries to something sensible.

My custom wake word (hey duri) test performance so far is: accuracy = 97.6485%; recall = 96.7269%; precision = 98.5475%; fpr = 1.4283%; fnr = 3.2731%; (N=9951.0) false accepts = 34.0; false accepts per hour = 6.373

Sounds like that is close (fnr < 5%) but no cigar (fp/h >> 0.5). I'll crank up my ambient penalty.

AUC is a good singular metric. You could also use a semi-genetic approach: When you shift down learning rate, find the checkpoint that minimizes one requirement (say, fp/h) where the other requirement (say, fnr) is already met, and reset the weights to that checkpoint before continuing training at the finer learning rate.

I notice in my training that, fairly quickly, none of the metrics show a consistent improvement. So, I think a smaller learning rate is needed much sooner to get consistent behavior.

kahrendt commented 1 month ago

Interesting ideas, I'll have a go at implementing them and see how they work out. I haven't encountered the PySoundFile issue, what OS are you training on?

All the metric outputs are for a probability cutoff of 0.5. If you increase the cutoff, false accepts should go down (and hopefully not reduce the recall too bad). Also, some of the validation/test samples may be extremely challenging (-10 dB SNR with strong reverberation), so in practice, that recall may be sufficient for real world use. I'll add the recall at various FAPH with certain cutoffs in the final testing result output to help determine this. The various metric goals I mentioned are for real-world use. It isn't clear how those numbers correspond to the validation metrics on TTS samples.

You can weigh the negative samples more to reduce the false accepts per hour at a cutoff of 0.5, but I am finding that is unnecessary/detrimental to the overall performance of the model if you increase the weight too much. The ESPHome component configuration easily lets you set this value per model, so it is an easy way to get usable results.

Omnipius commented 1 month ago

I'm training on Ubuntu 22.04 LTS running inside WSL2 on Windows 10. I'm guessing the hiccup might have to do with something trying to go too fast for the VM.

I see what you mean about the cutoff. Maybe it would be best to just do a cutoff sweep to check if an acceptable setting (fnr<5% & fp/h<0.5) exists. If it does, the model is good and report out the upper and lower cutoff bounds. Alternatively, plot the sweep and let the user decide what cutoff they like best.

Ultimately, nothing beats live user testing. Closest we have to that is Picovoice for 'alexa'. So...for a model which meets the requirements on Picovoice, what are the corresponding TTS performance metrics? I think we can use that to come up with TTS performance requirements. I guess that's flipping the script a bit. Instead of determining which model is 'best', we instead determine when a model is 'good enough'.

BTW: How large are your TTS datasets? I'm working with 50000 train, 5000 validation and test for both positive and negative.

kahrendt commented 1 month ago

I haven't yet determined how many TTS samples are needed. I've been typically using 100,000 for training, but I do believe this is excessive. I have recently been using Nabu Casa's cloud TTS samples for validation and testing, but I am only using about 80 different voices for this. To help deal with this smaller sample size, I repeatedly augment them aggressively... this may or may be not be enough!

I have made a draft PR that includes many of the changes I have mentioned here, and I hope to merge it in the next week or two as clean up and address a few more issues with it. It has a new model architecture that is faster while still being accurate, so I encourage you to try that one as well with your experiments!

Omnipius commented 3 weeks ago

@kahrendt Does that PR include the scaling change you described above?

I have an AIO notebook up and working using that branch. (LMK if I should create a branch or fork with that NB) However, I'm getting both high FRR (12-20%) and viable cutoff ranges in the 0.85-0.98 range on both the MixedNet and Inception models. I'm using the complete audioset and fma small datasets, 500 speakers for 100,000 TTS training samples each for positive and negative, 10,000 samples each for validation and test, separate validation and test TTS sets. Thoughts?

kahrendt commented 2 weeks ago

@kahrendt Does that PR include the scaling change you described above?

I have an AIO notebook up and working using that branch. (LMK if I should create a branch or fork with that NB) However, I'm getting both high FRR (12-20%) and viable cutoff ranges in the 0.85-0.98 range on both the MixedNet and Inception models. I'm using the complete audioset and fma small datasets, 500 speakers for 100,000 TTS training samples each for positive and negative, 10,000 samples each for validation and test, separate validation and test TTS sets. Thoughts?

Yes, the PR implements scaling the uint16_t to the appropriate float equivalent.

It is hard to tell with how the models will perform in real-life with the TTS samples, especially with challenging background noise augments included in the mix. I have been getting viable cutoffs in around that range with my newer models as well. Once you get a model to this stage, the best thing you can do is load it on an ESP32 and test it out!

I have moved to using a 10 ms step size for the spectrogram feature generation instead of the current 20 ms step size. This has improved accuracy consistently. To avoid running an inference every 10 ms, I now stride the first layer convolution over 3 spectrogram windows, so the inference only runs once every 30 ms. This ESPHome PR implements support for this, but the WIP mWW pull request doesn't contain support for that. I will work on updating that and uploading a new set of precomputed features (unfortunately, it requires an all new set!)

Omnipius commented 2 weeks ago

Looking forward to that update.

I have been pushing trained models into my ESP32-S3-BOX-3. However, I've found that in practice I have to yell into the mic array point blank to get it to recognize. Maybe there's a scaling or gain setting somewhere in the chain that I need to adjust?

sarvasvkulpati commented 2 weeks ago

hey, not sure if this is the right thread to ask this, if I had to train a custom word from scratch as of the latest additions, how would I go about it?

I clicked through the feature_generation notebook in the 2024-06-14-improvements branch, and I'm not sure what to do once it reaches the cell with:

clips = Clips(input_directory='generated_samples/positive/validation', 
              file_pattern='*.mp3', 
              max_clip_duration_s=None,
              remove_silence=True, # HA Cloud TTS samples have extra silence at end, so trim it off first.
              )

It doesn't seem like there was a generated_samples directory created.

gustvao commented 2 weeks ago

hey, not sure if this is the right thread to ask this, if I had to train a custom word from scratch as of the latest additions, how would I go about it?

I clicked through the feature_generation notebook in the 2024-06-14-improvements branch, and I'm not sure what to do once it reaches the cell with:
clips = Clips(input_directory='generated_samples/positive/validation', 
              file_pattern='*.mp3', 
              max_clip_duration_s=None,
              remove_silence=True, # HA Cloud TTS samples have extra silence at end, so trim it off first.
              )
It doesn't seem like there was a generated_samples directory created.

I also used the notebook of that branch and managed to get to the final training phase. I let a paid colab intance of gpu to run overnight but unfortunately my laptop run out of juice before it finished. My guesstimate was that it would take, only for the training phase take around 8h with the gpu instance (weakest one) from colab, does it make sense @kahrendt ?

@sarvasvkulpati , in order to get the samples you should use the notebook of training wakewords of openwakeword , that piece should give you all the inputs needed for that notebook of that branch.

Once I manage to go tru the entire 10h process again and succeed , I can share with you guys my notebook.

hope it works :) for now I am Just using hey jarvis

kahrendt commented 2 weeks ago

@sarvasvkulpati That bit of code assumes you have a directory full of audio files to convert into the validation set. I have been using TTS samples via Home Assistant's cloud for validation and testing (it is against the terms of service to use them in the training set). The training samples are generated using Piper. You could generate separate validation and test sets using Piper as well.

@gustvao I have to look into it more, but when I have tested this on Colab, it is much slower than I would expect. I haven't run a full training run on Colab, but 8 hours does seem about right at the pace it was going in my initial tests. For comparison, nn my M1 Max chip, I can run a full training process in around 2 hours. I believe mWW uses the legacy Adam optimizer, as that is the only one supported on the Macbook chip. It may be faster on Google Colab if you use the modern Adam optimizer.

kahrendt / microWakeWord

Add all-in-one training notebook #2