ogallagher / terry

Terry the virtual secreTERRY
0 stars 0 forks source link

Speech-to-text library #1

Closed ogallagher closed 4 years ago

ogallagher commented 4 years ago
ogallagher commented 4 years ago

This is a good beginner’s tutorial for kaldi. It goes into a bit more depth than I’m hoping to need, since it discusses training a network to map speech with text using custom sound files, which is not what I intend to use. I want to use a pre-trained network that came from a sizeable sample of speakers with plenty to say.

This is a list of trained models that can be used with kaldi to transform speech to text, rather than create a new model.

ogallagher commented 4 years ago

Here’s a hopefully straightforward tutorial for using a trained model with Kaldi, specifically aspire, which was trained using the Fisher English dataset and some noise.

This is a post explaining why one might choose to use the aspire model with kaldi for speech-to-text conversion (basically it has a low error rate), and how to extend that model to support domain-specific words. I hope not to have to extend the vocabulary.

ogallagher commented 4 years ago

DeepSpeech uses sox (Sound eXchange) to resample .wav files to be 16kHz, and a number of python libraries. Here are instructions for using the different bindings, which should include somewhere how to compile them to WIN- and OSX-compatible command-line programs in c++.

ogallagher commented 4 years ago

I’ve been able to get through the deepspeech installation instructions for the osx CLI program, and can now run wav audio files through deepspeech and return a most-likely transcription to the console with a command like:

./deepspeech \
    --model models/output_graph.pbmm \
    --lm models/lm.binary \
    --trie models/trie \
    --stream 320 \
    --audio \
    audio/owen_not_accurate.wav

Getting kaldi to work with the trained aspire model is proving to be more difficult so far.

ogallagher commented 4 years ago

I finally was able to create an example transcription with the kaldi aspire trained model and some test audio. However, the accuracy seems to be much worse compared to the deepspeech model, it was more difficult to do, and there are more steps involving reading and writing to the filesystem, making it slower and more difficult to automate.

Therefore, between these two options, for my use case, I pick deepspeech.

ogallagher commented 4 years ago

Running the mac deepspeech cli command through java works.