Speech-to-text library - Githubissues

ogallagher commented 4 years ago

snips currently managed by Sonos and is open-source. It is free for non-commercial use. It processes commands offline, supporting networking through the IoT edge network. However, the library goes as far as matching an instruction with a known intent (basically, a preset action), which goes a step to far. I think it also pairs intents with OS-specific execution, and is does not support Windows.
mycroft is like Snips, this only supports *nix operating systems because it maps commands to machine execution. I’m only looking to convert to text and have Terry do the intent mapping (via instruction classifier and action/widget mapper)
kaldi written in C++, uses TensorFlow for its neural network, and is open-source. The scope is what I’m looking for: speech-to-text only. Much of the library uses *nix scripts to pass call and pass data between the modules, but this might be applicable to Terry by pulling the files together with Java file I/O classes.
deepspeech has a trained model that has bindings to be run through python or nodejs (or c++, c#, rust, or go), which look relatively simple to use. I believe the model was trained on the librispeech dataset, which is essentially composed of audiobooks.

ogallagher commented 4 years ago

This is a good beginner’s tutorial for kaldi. It goes into a bit more depth than I’m hoping to need, since it discusses training a network to map speech with text using custom sound files, which is not what I intend to use. I want to use a pre-trained network that came from a sizeable sample of speakers with plenty to say.

This is a list of trained models that can be used with kaldi to transform speech to text, rather than create a new model.

ogallagher commented 4 years ago

Here’s a hopefully straightforward tutorial for using a trained model with Kaldi, specifically aspire, which was trained using the Fisher English dataset and some noise.

This is a post explaining why one might choose to use the aspire model with kaldi for speech-to-text conversion (basically it has a low error rate), and how to extend that model to support domain-specific words. I hope not to have to extend the vocabulary.

ogallagher commented 4 years ago

DeepSpeech uses sox (Sound eXchange) to resample .wav files to be 16kHz, and a number of python libraries. Here are instructions for using the different bindings, which should include somewhere how to compile them to WIN- and OSX-compatible command-line programs in c++.

ogallagher commented 4 years ago

I’ve been able to get through the deepspeech installation instructions for the osx CLI program, and can now run wav audio files through deepspeech and return a most-likely transcription to the console with a command like:

./deepspeech \
    --model models/output_graph.pbmm \
    --lm models/lm.binary \
    --trie models/trie \
    --stream 320 \
    --audio \
    audio/owen_not_accurate.wav

Getting kaldi to work with the trained aspire model is proving to be more difficult so far.

ogallagher commented 4 years ago

I finally was able to create an example transcription with the kaldi aspire trained model and some test audio. However, the accuracy seems to be much worse compared to the deepspeech model, it was more difficult to do, and there are more steps involving reading and writing to the filesystem, making it slower and more difficult to automate.

Therefore, between these two options, for my use case, I pick deepspeech.

ogallagher commented 4 years ago

Running the mac deepspeech cli command through java works.

ogallagher / terry

Speech-to-text library #1