This is almost certainly the wrong place to discuss this, so please do redirect me.
The problem
Users' problems:
Inaccuracies in wake word detection or STT are frustrating and degrade the user experience
You can develop your own Wake Word, but training data is limited to the sample generator
Developers' problems:
Voice data is hard and expensive to collect, clean and train on
There is a wide variety in microphone quality and hardware that is not easily simulated
The variation of voices generated by sample generator does not equal the variation of users' voices, (accents, environments, reverberation, microphones, volumes etc)
Proposed solution
A user sets up a local voice pipeline, if they wish to use a custom wake word they are signposted to a webpage I will call "VoiceTrainer"
At VoiceTrainer they will go through on onboarding process:
Confirm they want to train a custom wake word
Privacy page where it explained that recordings of their voice will be contributed to open source databases and that people might be able to hear samples of their voice and it will be used to also train other people's wake words.
A user who refused is re-directed to the old wake word training guide
The user is provided with an endpoint URL that can be added to home assistant/wyoming
The user connects their voice voice pipeline to the VoiceTrainer Endpoint - ideally through an option within the HA or add-on UI.
The user defines their custom wake word e.g "Doris"
A check is run to see of the wake word already exists in the open database
The user is then asked to record several samples of their own wake word on one or more voice satellites and listen back to them
The user is then asked to record a random selection of X other people's wake words. (Random selection weighted towards newer wake words)
These recordings will be used to help other people (re)train their own voice assistants
Recordings of other people's wake words will be used as a negative sample on their own wake word.
The user is then asked to record some random passage of text that does not contain wake-words (for use as a negative wake word sample, and possibly for STT training)
The user is then asked to listen to a small number of other people's contributions to verify their quality and accuracy.
The user is then directed to a google collab instance or similar and is provided with an up to date database of their voice samples (and everyone else's)
Once 100 other users have contributed samples of "Doris" the user is notified and a suggestion is made that they re-train their model using the new crowd-sourced data.
The model is uploaded and open-sourced
The result
An increasing database of wake-words with positive and negative samples
A database of open-source STT samples for use in training datasets
A wide variety of voices, on a wide variety of hardware
From a sample population of people who use local voice hardware
An incentive structure so that people help train each other's wake words
Issues
This is a bit of a 'pyramid scheme', in that the first people will may benefit the most, and last people will have no-one to train their models. This would be fine if there are enough contributors and a steady stream of new users
This is probably quite a substantial undertaking to build this infrastructure
Who pays for the training compute if it can't be done within a google collab?
I've just watched Mike's piece from the recent state of the open home livestream and it got me thinking.
This is almost certainly the wrong place to discuss this, so please do redirect me.
The problem
Users' problems:
Developers' problems:
Proposed solution
The result
Issues