While deepspeech can do a good job with converting audio to text (especially after normalization), it could do an even better time. An interface needs to be designed allowing users to teach it new words, or accents/pronunciations specific to the end user.
One method could be a sort of wizard: ie, the client guides the user through saying a series of words and phrases, and then these canned words/phrases are used to train the engine.
Another method could be a way for a client to simply flag improper audio to text conversions, allowing someone server-side to review and manually improve (ie, perhaps using that audio-file to train).
A specific use-case would be to teach the engine names, ie "Kakaia", that are not part of its default dataset.
While deepspeech can do a good job with converting audio to text (especially after normalization), it could do an even better time. An interface needs to be designed allowing users to teach it new words, or accents/pronunciations specific to the end user.
One method could be a sort of wizard: ie, the client guides the user through saying a series of words and phrases, and then these canned words/phrases are used to train the engine.
Another method could be a way for a client to simply flag improper audio to text conversions, allowing someone server-side to review and manually improve (ie, perhaps using that audio-file to train).
A specific use-case would be to teach the engine names, ie "Kakaia", that are not part of its default dataset.