synesthesiam / voice2json

Command-line tools for speech and intent recognition on Linux
MIT License
1.08k stars 63 forks source link

voice2json project gone quiet ? any project forum discussion available anymore ? #68

Open OSS542 opened 2 years ago

OSS542 commented 2 years ago

I've just managed to incorporate voice2json into a voice control interface I built for my Linux system. That interface has such capabilities as multiple configurable activation words, conversation capability with configurable inactivity timeouts, and so on. It is capable of using either pocketsphinx or vosk, and to that I have added voice2json as an optional layer. In experimenting with it, I have a number of observations to make regarding voice2json. What is the proper forum for such discussion ? It seems as though the rhasspy.org forum has gone quiet as the last activity there was last December. I also note that the last commit to voice2json was in July of last year, and was wondering if the voice2json project has been abandoned.

synesthesiam commented 2 years ago

Hi @OSS542, the project has not definitely been abandoned! I've since changed jobs, and have been focusing more on Larynx TTS. The voice2json forum has been pretty quiet; I don't have any other place besides here to really discuss things :/

Going forward, my plan is to do some significant upgrades to voice2json with the goal of it forming the core of Rhasspy's next major version. Specifically, I want to create a plugin system so different STT/TTS/NLU engines can be more easily created/maintained by contributors. Then, these same plugins can be used with Rhasspy or other open source voice assistants.

I'd be interested to hear your observations, since it may be time to refactor some of the voice2json commands. I've been working on a hybrid speech to text system that can recognized pre-trained commands (like voice2json does now), but fall back to open transcription when that fails. This is faster to train and more accurate (in my testing) than the mixed-mode training. Any thoughts?

OSS542 commented 2 years ago

A few things that I have noticed in using voice2json integrated with my voice recognition system :

The language used for the sentences.ini specification file is very flexible, and very easy and pleasant to work with once one understands exactly what is going on. Some fairly extensive and versatile vocabularies can be readily constructed using it.

It is slightly slower than using pocketsphinx directly (ex. without voice2json), as might be expected.

That said, I have often noted significant latency in its responses (5-10 sec), particularly on the first invocation of a particular command, or with rarely used commands. It is often considerably faster for more frequently used commands (possibly due to memory caching by linux ?)

Voice2json is generally somewhat less accurate than pocketsphinx directly, though still effective in a low noise environment. I am wondering if this might be due to the use of a larger dictionary with voice2json, as opposed to smaller generated dictionary and language model files via http://www.speech.cs.cmu.edu/cgi-bin/tools/lmtool/run as can be done with pocketsphinx. Voice2json does seem to be significantly more sensitive to background noise than using pocketsphinx directly. The significant latency noted above often appears to occur when such noise is present (ex. white or fan noise). Elimination of the noise source appears to result in expedited processing (flushing ?) of pending unrecognized voice commands immediately afterwards.

I'd be very interested in trying out the open transcription when that is ready. Please keep me posted.

synesthesiam commented 2 years ago

Out of curiosity, why are you using pocketsphinx instead of the Kaldi models?

OSS542 commented 2 years ago

I tried using Kaldi via Vosk before I started working with voice2json. Vosk was always considerably slower, requiring 2-3 seconds to respond to anything, even with the smallest models. Vosk did have excellent dictation capabilities, but the models required for good dictation were vary large (about 1 or 1.5 Gb) and slow to load. I have not tried using Kaldi with voice2json.

OSS542 commented 2 years ago

I should also note that Kaldi uses CUDA but not OpenCL. CUDA is for NVidia GPUs only as far as I know. Being unable to use a GPU might tend to slow things down considerably.