Testing methodology & tools for improving speech recognition accuracy

Goal

A useful testing methodology, and supporting tools, that help developers (and/or users) to evaluate, compare, and most importantly improve the accuracy and performance of speech recognition of:

models, and
grammar modules.

The keyword useful is intended to mean specific and sufficient enough to guide developers (and/or users) to changes they can make that will have an improvement on speech recognition accuracy and performance, on the grammar and types of commands they want to use.

Background

To date, the testing of models and grammar modules for speech recognition accuracy has almost exclusively been done:

manually - time consuming
without objective or quantified metrics - imprecise

The primary function of the type of testing describe in this issue is to:

assist developers in future model development, e.g.
- is a tailored model required?
- where does the current model fail?
- what data should a tailored model be trained on?
assist developers and/or users, in evaluating impact of changes to grammar modules, e.g.
- selection of words/phrases for recognition within commands
- what level of performance should they expect from the model + grammar module?
- record and evaluate the level of performance of the model + grammar module with my speech and setup

Please check with @jwebmeister what the destination branch should be for pull requests. Dev-only (not useful to users) features should not be pulled into the main branch, at this current time.

Note: Development and training of models (to improve speech accuracy) is closely related, but will be primarily tracked by other issue(s). Additionally for info, the modeldev branch is primarily used for features specifically related to model development that won't be packaged into releases to users.

Note: "Recorded audio" mentioned below is (and should be) local only, i.e. @jwebmeister is recording and using his own audio. For now - a big no to any collection + transfer of user audio data; and for now and in the future - never without explicit permission from users.

Possible approaches

Possible ideas of approaches that could be considered (not an exhaustive list):

Recorded audio test dataset - record and store audio files of spoken commands / phrases. Use to the test accuracy of models and/or grammar modules.
Computer generated audio test dataset - generate audio files of spoken commands / phrases using text to speech. Potential for varying auditory (pitch, volume, speed), as well as varying words or phrases with equivalent words or phrases.
Calculate phonetic similarity - measure, compare and rank, how phonetically similar commands are, within a grammar module. Identify commands with high similarity to guide developers and/or users to potentially chose different phrases (if recognition accuracy is an issue).

It is likely more than one approach should be pursued and developed.

Tracking

This issue a high-level tracker of all work related to testing methods and tools for speech recognition accuracy and performance.

Tools

[x] Establish process for recording user audio and metadata for training and test datasets (e.g. using Dragonfly, within Tacspeak) - @jwebmeister
[x] Develop process or tool for reviewing and cleaning audio + metadata (e.g. edit retain.tsv + powershell scripts, or modify tacspeak + dragonfly, or modify jwebmeister/speech-training-recorder, or use more advanced library lhotse-speech/lhotse) - @jwebmeister
[x] Establish process for generating computer generated speech from text (e.g. using Dragonfly within Tacspeak, or using more advanced TTS models if it's not used real-time and not performance constrained) - @jwebmeister
[ ] Develop phonetic similarity calculation and comparison methods, heuristic for grammar module accuracy
[ ] Develop tool for visualising and reviewing model generated grammar (.fst) files, heuristic for grammar module accuracy (e.g. this commit)
[x] Develop tool and process for testing "general" (i.e. dictation) model accuracy - @jwebmeister
[x] Develop tool and process for testing "specific" (i.e. command) model + grammar module accuracy - @jwebmeister

Ready or Not - test case for testing methodology + tools

[x] Recorded user audio, training & test datasets - @jwebmeister
[x] Generated computer generated audio, training & test datasets - @jwebmeister
[x] Analysed "dictation" base model using test recorded audio dataset - @jwebmeister
[x] Analysed "command" base model + grammar module using test recorded audio dataset - @jwebmeister
[x] Analysed "dictation" base model using test computer generated audio dataset - @jwebmeister
[x] Analysed "command" base model + grammar module using test computer generated audio dataset - @jwebmeister
[ ] Analysed grammar module for phonetically similar commands

jwebmeister / tacspeak