Implement Vosk/Kaldi engine (ASR)

PeterBowman commented 1 year ago

PocketSphinx, the speech recognition engine we have been using for a while, is not giving accurate results. I am going to implement Vosk, an offline, lightweight engine I think is based on Kaldi. I believe (not quite sure) that PocketSphinx was trying to recognize individual phonemes encoded in a dictionary (hence we could define our own and include the commands we need, e.g. the waiter demo used only four), whereas Kaldi performs some kind of inference based on a language model. There are more than 20 supported languages for the latter; in the case of Spanish there is a small model (~50 MB) and a larger one (~1.4 GB).

This is currently WIP, I have already implemented our IDL-based interface in Python so that it uses live microphone data: https://github.com/roboticslab-uc3m/speech/commit/2dd945939c7142d374912b4a5a5dee130724872c (heavily inspired on this sample app). My aim is to enhance speechRecognition.py so that it selects the desired backend (either PocketSphinx or Vosk/Kaldi). I would also like to replace the current microphone muting/unmuting implementation (through direct calls via ALSA) with Python's sounddevice package (using PortAudio under the hood), for simplicity.

See also:

PeterBowman commented 1 year ago

Out of the scope of this issue, but @jgvictores suggested a few alternative backends we could implement some day:

https://github.com/mozilla/DeepSpeech (sadly, looks pretty much stale since a couple of years ago)
https://github.com/flashlight/flashlight (see their asr module)
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/v1.15.0/asr/intro.html

PeterBowman commented 1 year ago

I would also like to replace the current microphone muting/unmuting implementation (through direct calls via ALSA) with Python's sounddevice package (using PortAudio under the hood), for simplicity.

Done at https://github.com/roboticslab-uc3m/speech/commit/9bbf69f9afd1238f016b6c138ed174b02d08a92e, the current PocketSphinx backend has been reimplemented. It doesn't depend anymore on ALSA, GTK, GI, GStreamer... Muting and unmuting actually does nothing on the hardware side, it just pauses acquisition from the raw input microphone stream.

PeterBowman commented 1 year ago

Done at https://github.com/roboticslab-uc3m/speech/commit/4114c565e15959b8010a0054046d7f4353ad3b79.

PeterBowman commented 1 year ago

See https://github.com/roboticslab-uc3m/ros-vosk-asr for a catkin-enabled ROS 1 package.

roboticslab-uc3m / speech

Implement Vosk/Kaldi engine (ASR) #32