[ros_speech_recognition] Add vosk engine

nakane11 commented 1 year ago

This PR enables vosk engine in speech recognition. More APIs became available after SpeechRecognition==3.9.0. Vosk is a speech recognition toolkit which works offline and supports Japanese.

Sample (tested with PR2)

$ wget https://alphacephei.com/vosk/models/vosk-model-small-ja-0.22.zip -P /tmp
$ unzip /tmp/vosk-model-small-ja-0.22.zip -d /tmp

<launch>
  <arg name="audio_topic" default="/audio" doc="Name of audio topic captured from microphone" />
  <arg name="voice_topic" default="/speech_to_text" doc="Name of text topic of recognized speech" />
  <arg name="n_channel" default="1" doc="Number of channels of audio topic and microphone. '$ pactl list short sinks' to check your hardware" />
  <arg name="depth" default="16" doc="Bit depth of audio topic and microphone. '$ pactl list short sinks' to check your hardware" />
  <arg name="sample_rate" default="16000" doc="Frame rate of audio topic and microphone. '$ pactl list short sinks' to check your hardware"/>
  <arg name="device" default="" doc="Card and device number of microphone (e.g. hw:0,0). you can check card number and device number by '$ arecord -l', then uses hw:[card number],[device number]" />
  <arg name="engine" default="Vosk" doc="Speech to text engine. TTS engine, Google, GoogleCloud, Sphinx, Wit, Bing Houndify, IBM" />
  <arg name="language" default="en-US" doc="Speech to text language. For Japanese, set ja" />
  <arg name="continuous" default="true" doc="If false, /speech_recognition service is published. If true, /speech_to_text topic is published." />
  <arg name="auto_start" default="true" doc="Whether speech_recognition starts automatically or not. This parameter works when continuous is true" />

  <arg name="self_cancellation" default="true" doc="Do not recognize the audio when robot is speaking or not." />
  <arg name="tts_tolerance" default="1.0" doc="Tolerance second for recognizing whether robot is speaking or not" />
  <arg name="tts_action_names" default="['sound_play']" doc="tts action name. these servers outputs are ignored by sound_recognition" />

  <node name="speech_recognition"
        pkg="ros_speech_recognition" type="speech_recognition_node.py"
        respawn="true"
        output="screen">
    <rosparam subst_value="true">
      audio_topic: $(arg audio_topic)
      voice_topic: $(arg voice_topic)
      n_channel: $(arg n_channel)
      depth: $(arg depth)
      sample_rate: $(arg sample_rate)
      engine: $(arg engine)
      language: $(arg language)
      continuous: $(arg continuous)
      auto_start: $(arg auto_start)
      self_cancellation: $(arg self_cancellation)
      tts_tolerance: $(arg tts_tolerance)
      tts_action_names: $(arg tts_action_names)
      vosk_model_path: /tmp/vosk-model-small-ja-0.22
    </rosparam>
  </node>

</launch>

nakane11 commented 1 year ago

Misrecognition for speech is few, but to avoid it for silence, I use a filter as https://github.com/nakane11/navigation_pr2/blob/d57d4253b36c96a0679753cda7a254dd8e333c1d/node_scripts/filter_vosk.py

mqcmd196 commented 1 year ago

How about download model when launch this node at first

nakane11 commented 1 year ago

Thank you!
It's a good idea. I'll update it.

nakane11 commented 1 year ago

@mqcmd196 If vosk_model_path is specified, load it as priority. If vosk_model_path is none and languageis en-US or ja, models already downloaded to trained_data are used.

tkmtnt7000 commented 1 year ago

Current implementation, I got the following error when I don't set ~vosk_model_path manually as README say.

tsukamoto@tsukamoto-desktop-ryzen ~/ros/fetch_ws/src/jsk-ros-pkg/jsk_3rdparty/ros_speech_recognition/src (vosk *%) 
$ ROSCONSOLE_FORMAT='[${severity}] [${time}] [${node}]: [${message}]' roslaunch ros_speech_recognition speech_recognition.launch engine:=Vosk launch_sound_play:=false language:=ja
... logging to /home/tsukamoto/.ros/log/d506cd58-f38e-11ed-9492-937962485673/roslaunch-tsukamoto-desktop-ryzen-1067313.log
Checking log directory for disk usage. This may take a while.
Press Ctrl-C to interrupt
WARNING: disk usage in log directory [/home/tsukamoto/.ros/log] is over 1GB.
It's recommended that you use the 'rosclean' command.

started roslaunch server http://tsukamoto-desktop-ryzen:43425/

SUMMARY
========

PARAMETERS
 * /audio_capture/channels: 1
 * /audio_capture/depth: 16
 * /audio_capture/device: 
 * /audio_capture/format: wave
 * /audio_capture/sample_rate: 16000
 * /rosdistro: noetic
 * /rosversion: 1.16.0
 * /speech_recognition/audio_topic: /audio
 * /speech_recognition/auto_start: True
 * /speech_recognition/continuous: True
 * /speech_recognition/depth: 16
 * /speech_recognition/enable_sound_effect: False
 * /speech_recognition/engine: Vosk
 * /speech_recognition/language: ja
 * /speech_recognition/n_channel: 1
 * /speech_recognition/sample_rate: 16000
 * /speech_recognition/self_cancellation: True
 * /speech_recognition/tts_action_names: ['sound_play']
 * /speech_recognition/tts_tolerance: 1.0
 * /speech_recognition/voice_topic: /speech_to_text

NODES
  /
    audio_capture (audio_capture/audio_capture)
    speech_recognition (ros_speech_recognition/speech_recognition_node.py)
    speech_recognition_candidates_to_string (ros_speech_recognition/speech_recognition_candidates_to_string.py)

ROS_MASTER_URI=http://localhost:11311

process[audio_capture-1]: started with pid [1067327]
process[speech_recognition-2]: started with pid [1067328]
process[speech_recognition_candidates_to_string-3]: started with pid [1067333]
[WARN] [1684207609.205358] [/speech_recognition_candidates_to_string]: [[/speech_recognition_candidates_to_string] subscribes topics only with child subscribers. Set '~always_subscribe' as True to have it subscribe always.]
[ERROR] [1684207609.253494] [/speech_recognition]: [action 'sound_play' is not initialized.]
[INFO] [1684207609.270212] [/speech_recognition]: [Enabled continuous mode]
[INFO] [1684207609.270874] [/speech_recognition]: [Auto start: True]
[INFO] [1684207609.753036] [/speech_recognition]: [Set minimum energy threshold to 358.4727721173039]
[WARN] [1684207613.506773] [/speech_recognition]: [data_path: /home/tsukamoto/ros/fetch_ws/src/jsk-ros-pkg/jsk_3rdparty/ros_speech_recognition/trained_data]
[WARN] [1684207613.507372] [/speech_recognition]: [model_path_before: None]
[INFO] [1684207613.507909] [/speech_recognition]: [Loading model from /home/tsukamoto/ros/fetch_ws/src/jsk-ros-pkg/jsk_3rdparty/ros_speech_recognition/trained_data/vosk-model-small-ja-0.22]
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=4
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/tsukamoto/ros/fetch_ws/src/jsk-ros-pkg/jsk_3rdparty/ros_speech_recognition/trained_data/vosk-model-small-ja-0.22/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/tsukamoto/ros/fetch_ws/src/jsk-ros-pkg/jsk_3rdparty/ros_speech_recognition/trained_data/vosk-model-small-ja-0.22/graph/HCLr.fst /home/tsukamoto/ros/fetch_ws/src/jsk-ros-pkg/jsk_3rdparty/ros_speech_recognition/trained_data/vosk-model-small-ja-0.22/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:308) Loading winfo /home/tsukamoto/ros/fetch_ws/src/jsk-ros-pkg/jsk_3rdparty/ros_speech_recognition/trained_data/vosk-model-small-ja-0.22/graph/phones/word_boundary.int
[INFO] [1684207614.227814] [/speech_recognition]: [Result: b'\xe3\x81\x82\xe3\x81\x82']
[INFO] [1684207619.698366] [/speech_recognition]: [Loading model from None]
lang None does not exist
[ERROR] [1684207622.279914] [/speech_recognition]: [Unexpected error: (<class 'SystemExit'>, SystemExit(1), <traceback object at 0x7f722adf6b40>)]
Exception ignored in: <function Model.__del__ at 0x7f722cb3faf0>
Traceback (most recent call last):
  File "/home/tsukamoto/ros/fetch_ws/devel/.private/ros_speech_recognition/share/ros_speech_recognition/venv/lib/python3.8/site-packages/vosk/__init__.py", line 60, in __del__
    _c.vosk_model_free(self._handle)
AttributeError: 'Model' object has no attribute '_handle'
[INFO] [1684207634.369576] [/speech_recognition]: [Loading model from None]
lang None does not exist
[ERROR] [1684207635.364335] [/speech_recognition]: [Unexpected error: (<class 'SystemExit'>, SystemExit(1), <traceback object at 0x7f722b0571c0>)]
Exception ignored in: <function Model.__del__ at 0x7f722cb3faf0>
Traceback (most recent call last):
  File "/home/tsukamoto/ros/fetch_ws/devel/.private/ros_speech_recognition/share/ros_speech_recognition/venv/lib/python3.8/site-packages/vosk/__init__.py", line 60, in __del__
    _c.vosk_model_free(self._handle)
AttributeError: 'Model' object has no attribute '_handle'
[INFO] [1684207638.883466] [/speech_recognition]: [Loading model from None]
lang None does not exist
[ERROR] [1684207640.547415] [/speech_recognition]: [Unexpected error: (<class 'SystemExit'>, SystemExit(1), <traceback object at 0x7f7229de8e00>)]
Exception ignored in: <function Model.__del__ at 0x7f722cb3faf0>
Traceback (most recent call last):
  File "/home/tsukamoto/ros/fetch_ws/devel/.private/ros_speech_recognition/share/ros_speech_recognition/venv/lib/python3.8/site-packages/vosk/__init__.py", line 60, in __del__
    _c.vosk_model_free(self._handle)
AttributeError: 'Model' object has no attribute '_handle'
^C[speech_recognition_candidates_to_string-3] killing on exit
[speech_recognition-2] killing on exit
[audio_capture-1] killing on exit
[speech_recognition-2] escalating to SIGTERM
[speech_recognition-2] escalating to SIGKILL
Shutdown errors:
 * process[speech_recognition-2, pid 1067328]: required SIGKILL. May still be running.
shutting down processing monitor...
... shutting down processing monitor complete
done

tkmtnt7000 commented 1 year ago

Test fails on venv locking dependencies.

[ros_speech_recognition:results] Full test results for 'test_results/ros_speech_recognition/venv_check-ros_speech_recognition-requirements.xml'
[ros_speech_recognition:results] -------------------------------------------------
[ros_speech_recognition:results] <?xml version='1.0' encoding='utf-8'?>
[ros_speech_recognition:results] <testsuite name="venv_check" tests="1" failures="1" errors="0"><testcase name="check_locked" classname="catkin_virtualenv.Venv"><failure message="/home/tsukamoto/ros/fetch_ws/src/jsk-ros-pkg/jsk_3rdparty/ros_speech_recognition/requirements.txt is not fully locked">Consider defining INPUT_REQUIREMENTS to have catkin_virtualenv generate a lock file for this package.
[ros_speech_recognition:results] See https://github.com/locusrobotics/catkin_virtualenv/blob/master/README.md#locking-dependencies.
[ros_speech_recognition:results] The following changes would fully lock /home/tsukamoto/ros/fetch_ws/src/jsk-ros-pkg/jsk_3rdparty/ros_speech_recognition/requirements.txt:
[ros_speech_recognition:results] --- 
[ros_speech_recognition:results] 
[ros_speech_recognition:results] +++ 
[ros_speech_recognition:results] 
[ros_speech_recognition:results] @@ -1,2 +1,12 @@
[ros_speech_recognition:results] 
[ros_speech_recognition:results] +certifi==2023.5.7
[ros_speech_recognition:results] +cffi==1.15.1
[ros_speech_recognition:results] +charset-normalizer==3.1.0
[ros_speech_recognition:results] +idna==3.4
[ros_speech_recognition:results] +pycparser==2.21
[ros_speech_recognition:results] +requests==2.30.0
[ros_speech_recognition:results]  speechrecognition==3.9.0
[ros_speech_recognition:results] +srt==3.5.3
[ros_speech_recognition:results] +tqdm==4.65.0
[ros_speech_recognition:results] +urllib3==2.0.2
[ros_speech_recognition:results]  vosk==0.3.45
[ros_speech_recognition:results] +websockets==11.0.3</failure></testcase></testsuite>
[ros_speech_recognition:results] -------------------------------------------------
[ros_speech_recognition:results] test_results/ros_speech_recognition/venv_check-ros_speech_recognition-requirements.xml: 1 tests, 0 errors, 1 failures, 0 skipped
[ros_speech_recognition:results] Summary: 4 tests, 0 errors, 1 failures, 0 sk

You should set CHECK_VENV as FALSE in CMakeLists.txt or write other dependencies the test suggested to requirements.txt

nakane11 commented 1 year ago

@tkmtnt7000 Thank you. I set CHECK_VENV as FALSE.

k-okada commented 1 year ago

closed via #474

jsk-ros-pkg / jsk_3rdparty

[ros_speech_recognition] Add vosk engine #462

Sample (tested with PR2)