rhasspy / hassio-addons

Add-ons for Home Assistant's Hass.IO
MIT License
63 stars 31 forks source link

vosk expansion_rules don't work #43

Open h3ndrik opened 1 year ago

h3ndrik commented 1 year ago

Once I add an expansion_rules to my de.yamlthe vosk add-on crashes on use (when speech gets sent)

sentences:
  - schalte das licht ein
  - (schalte|mach) das licht (im|in der|in dem) {area} (ein|an|aus)
lists:
  area:
    values:
      - wohnzimmer
      - küche
      - flur
      - badezimmer
expansion_rules:
  artikel: [der|die|das]

(Works fine if I delete the expansion_rules paragraph.)

Debug Log of the VOSK Add-on:

[22:37:18] INFO: Successfully sent discovery information to Home Assistant.
s6-rc: info: service discovery successfully started
s6-rc: info: service legacy-services: starting
s6-rc: info: service legacy-services successfully started
ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-8' coro=<AsyncEventHandler.run() done, defined at /usr/local/lib/python3.11/dist-packages/wyoming/server.py:28> exception=AttributeError("'list' object has no attribute 'strip'")>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/wyoming/server.py", line 35, in run
    if not (await self.handle_event(event)):
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/wyoming_vosk/__main__.py", line 282, in handle_event
    text = self._fix_transcript(original_text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/wyoming_vosk/__main__.py", line 327, in _fix_transcript
    lang_config = load_sentences_for_language(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/wyoming_vosk/sentences.py", line 107, in load_sentences_for_language
    expansion_rules[rule_name] = hassil.parse_sentence(rule_text)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/hassil/parse_expression.py", line 189, in parse_sentence
    text = text.strip()
           ^^^^^^^^^^
AttributeError: 'list' object has no attribute 'strip'
synesthesiam commented 1 year ago

You may need to quote [der|die|das] in the YAML. It's probably interpreting it as a list.

expansion_rules:
  artikel: "[der|die|das]"
h3ndrik commented 1 year ago

Ah, nice. Works now. Thank you very much. You might want to add that in the example in "vosk/DOCS.md" here and in the "README.md" of wyoming-vosk

I've tried it exactly like it's written there and that also didn't work. I'm going to leave this issue open in case you want to update the documentation. But it's solved for me, feel free to close this issue.

synesthesiam commented 1 year ago

Thanks! I'll update the example and the docs.

Any thoughts on the add-on itself? Can you share what your use case is maybe? I haven't promoted it at all yet. I'm thinking of making a tutorial video.

h3ndrik commented 1 year ago

I'm a fan of your work. I've been using Rhasspy and Romkabouter's ESP32 Satellite before. Nothing properly productive, mainly tinkering around. I'm currently digging into how this one works.

My general thoughts are: Wow boy is there much work to do on the microcontroller side... Silence detection and VAD only kinda work with the ADF (which unfortunately isn't open source, which isn't great at all) and the media_player component and some others don't work with the ESP-IDF requirement. openWakeWord works but it tears down the pipeline and fires on_start and on_end every few seconds. And I'm still debugging stuff so my harddisk gets filled with recordings of silence and random stuff. I'd like this to be easier for someone who is new to the stuff. But... I even managed to train my own wake-word. That's awesome.

Perspectively I'd like something like all the cool signal processing that's available in the big voice assistants. Being able to play music and subtract the output from the microphone so we can simultaneously listen to music and instruct it to stop. Have microphone arrays and far-field voice control, beam-forming and speaker recognition available. I suppose you've at one point seen how the Amazon bugging devices work, the signal processing really adds to the real-world usability. But we're still missing the absolute basics here. (I always preferred projects like ESP8266Audio to the ESP-ADF because it's free software. But there is no signal processing available and it's mainly for outputting sound.)

Whisper is a bit slow on my old server. And I really liked the idea of constraining the STT to the predefined sentences. (I'm currently porting the stuff from my Rhasspy Add-on config. But I still struggle with esphome, instead.) It immediately makes it blazing fast and does away with problems like a preposition not being transcribed correctly. For a wider audience it would be great if the sentences came from what HA is able to understand (automatically). But it doesn't seem like this was our main concern at this point. (And I've played with VOSK before. It's really easy to write a few shell scripts or small python scripts to integrate it into your own small projects. I had tied it into an Asterisk telephony server at some point.)

I think I'm going to file more bugreports once I get to dig into the VOSK addon. Currently the in/out replace doesn't work for me. It always gives me the fixed sentence back but without the replacement being done. And I'd need that for words which aren't in the keywords file (like the 'loo mo ss' example) only with german composite words that get blanks inserted inbetween.

My main use-case would be a voice assistant for the kitchen that can play music, set timers, tell you a joke and announce the weather in the morning, the delay on public transport, birthdays and appointments of the day. And add things to the shopping list. I take it for granted that I can also turn on and off some lights in the house. I'd scatter around a few more ESP32s to announce things in other rooms and play music, once it becomes useful.) And the last thing, I'm fooling around with LLMs (Artificial intelligence). An AI agent could give the house a proper personality and be tied into HA to control everything like a ship-computer on Star Trek does. That's maybe something to consider after the year after the Year of the Voice.

synesthesiam commented 1 year ago

Thanks for the feedback @h3ndrik! I've updated the Vosk add-on to (hopefully) fix the in/out replace issue.

My general thoughts are: Wow boy is there much work to do on the microcontroller side

Agreed. Hardware is so varied and moving so fast that it's hard to make progress. With Espressif especially, they keep deprecating boards by the time I get something working on them :smile:

Perspectively I'd like something like all the cool signal processing that's available in the big voice assistants.

I got an Echo Dot for testing and wow, it can hear you through just about anything. I don't know that we'll ever get there, honestly. Maybe if the big players give up fully on voice and sell their tech to someone willing to make chips that the rest of us can use.

For a wider audience it would be great if the sentences came from what HA is able to understand (automatically).

This is the plan, actually. I need an API on the Home Assistant side to get the entities and areas that have been exposed to Assist. With that, I can just plug those lists into the default intents and generate the possible sentences.

I'm fooling around with LLMs (Artificial intelligence).

They're getting faster and faster, so I'm hopeful that next year we'll be able to run a local LLM and use it with Home Assistant. I'm seeing more experiments where they constrain the LLM to produce JSON, for example. That would let you interface it to HA much more easily, and still produce interesting responses (inside the JSON).

Thanks again for testing and following my work!

h3ndrik commented 1 year ago

fix the in/out replace issue

Thank you very much. Can confirm it works and I've closed that issue.

Espressif [...] keep deprecating boards by the time I get something working on them

Hehe. I still have some older ESP32 boards (not by espressif) in my drawer. Mainly because I like to start hobby projects and don't finish them. But sometimes I pull out something like the old TAudio board which I'm currrently testing this on.

[signal processing] I don't know that we'll ever get there, honestly. Maybe if the big players give up fully on voice and sell their tech to someone willing to make chips that the rest of us can use.

Sadly I don't know much about signal processing. I've searched the internet for libraries and algorithms for noise suppression, echo cancellation and voice stuff. Seems there isn't anything good available to tinkerers like me. Mostly companies selling their proprietary solutions and DSPs. I'd like to get some microphone array board, but it would need to come with the signal processing already implemented. (And in a way that allows me to poke around.) Esphome just supports the basics regarding audio. I'd like to see more implemented there. I've opened PR https://github.com/esphome/esphome/pull/5613 to hopefully learn a bit and have a place to start.

[...] next year we'll be able to run a local LLM

Things are still moving crazy fast. I run Home Assistant on an (old) server, so I'm not that constrained like someone with an single board computer would be. The server doesn't have a GPU but I can run llama.cpp in a different virtual machine and I'm willing to connect it to the smart home at some point. I'm aware of llama.cpp's feature to constrain it to some grammar like outputting JSON.

In my opinion smaller models like Mistral 7B are surprisingly capable and still fast on a regular computer. And it knows a lot of things. Probably enough to be able to interact with me. I think with models in the size of Microsoft's phi-1 (but tuned for this use-case) we could have it run on a single board computer.

I'm still not completely sold on the idea of having LLMs and smart assistants in my life. They're nice, but on the other hand I can already do lots of stuff the way it is.

h3ndrik commented 1 year ago

Wow, the expansion rules expand fast. I've added the HA intent sentences to turn on and off devices, lights and set brightness and color. With optional articles, prepositions and areas. (to the Vosk sentences)

Now it says "Loading /share/vosk/sentences/de.yaml" for a minute and then the Vosk Addon kills the async event handler ;-)

It stopped displaying the list when it got to a 4 or 5 digit length... Both limiting sentences and correcting them doesn't deal with that amount.

I don't know enough about Vosk to make any recommendations here. But it seems ingesting that sentences file at runtime doesn't scale anywhere close to real-world usage.

I've turned back to Faster-Whisper but it always gets most of it right, but one character or word wrong. ("Schalte das Wohnzimmerlicht ein" -> "Schalte das Wohnzimmer nicht ein" ("Don't turn on the livingroom")) Meh.

synesthesiam commented 1 year ago

Can you post the YAML here so I can benchmark it?

synesthesiam commented 1 year ago

Update: I've switched to using an sqlite database to store the sentences, and only giving vosk the available words. On a Raspberry Pi 4, it only takes 1.34 seconds to generate 22,786 sentences, and 0.01 seconds to load the recognizer.

h3ndrik commented 10 months ago

Well, I can still make it hang for a few minutes if I try something like the following (setting brigness in percent). After that some async worker will generate an error message but at least it seems to generate the sqlite database for the next pipeline run.

sentences:
# light_HassLightSet
  - "<setzen> [<artikel>] Helligkeit von <name> auf {brightness} [Prozent] [ein]"
  - "[<artikel>] Helligkeit von <name> auf {brightness} [Prozent] <setzen>"
  - "dimme [[<artikel>] Helligkeit [von|vom] [<artikel>]] <name> [auf|zu] {brightness} [Prozent]"
  - "<name> [auf|zu] {brightness} [Prozent] dimmen"
#  - in: "dimme <name>""
#    out: "Setze Helligkeit von <name> auf 25"
lists:
  device:
    values:
      - in: fernseher
        out: Wohnzimmer TV
      - in: licht
        out: Deckenlicht Wohnzimmer
      - in: wohnzimmer licht
        out: Wohnzimmerlicht
      - in: deko licht
        out: Dekolicht
      - in: flur licht
        out: Flurlicht
      - in: licht am esstisch
        out: Esstischbeleuchtung
      - in: küchen beleuchtung
        out: Küchenbeleuchtung
      - in: licht in der küche
        out: Küchenbeleuchtung
  brightness:
    values:
      - in: ein
        out: 1
      - in: eins
        out: 1
      - in: fünf
        out: 5
      - in: zehn
        out: 10
      - in: fünfzehn
        out: 15
      - in: zwanzig
        out: 20
      - in: fünfundzwanzig
        out: 25
      - in: dreißig
        out: 30
      - in: vierzig
        out: 40
      - in: fünfzig
        out: 50
      - in: sechzig
        out: 60
      - in: siebzig
        out: 70
      - in: fünfundsiebzig
        out: 75
      - in: achtzig
        out: 80
      - in: fünfundachzig
        out: 85
      - in: neunzig
        out: 90
      - in: fünfundneunzig
        out: 95
      - in: neunundneunzig
        out: 99
      - in: hundert
        out: 100
  color:
    values:
      - in: "wei(ß|ss)"
        out: "white"
      - in: "schwarz"
        out: "black"
      - in: "rot"
        out: "red"
      - in: "orange"
        out: "orange"
      - in: "gelb"
        out: "yellow"
      - in: "grün"
        out: "green"
      - in: "blau"
        out: "blue"
      - in: "violett"
        out: "purple"
      - in: "lila"
        out: "purple"
      - in: "braun"
        out: "brown"

expansion_rules:
  artikel_bestimmt: "(der|die|das|dem|der|den|des)"
  artikel_unbestimmt: "(ein|eine|eines|einer|einem|einen)"
  artikel: "(<artikel_bestimmt>|<artikel_unbestimmt>)"
  name: "[<artikel>] {device}"
  setzen: "(setz[e|en]|stell[e|en]|einstellen|änder[e|n]|veränder[e|n])"
  licht: "[<artikel>] (Licht|Lampe|Beleuchtung)"
  brightness: "{brightness} [Prozent]"