toverainc / willow

Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative
https://heywillow.io/
Apache License 2.0
2.61k stars 96 forks source link

TTS Audio Issues #159

Closed nikito closed 1 year ago

nikito commented 1 year ago

When getting certain TTS responses, I either see errors on the console, or the output is truncated. Looking at code I see in hass.c the home assistant TTS response seems to be limited to 64 characters (not sure if that is due to a hardware limitation?) which I believe is the reason for the truncated responses, however the other random errors I am uncertain of as the text is well below 64 characters. Here's a snapshot of an example that breaks: image EDIT: For this example I think it may be due to the whitespace the response is generating, seems the TTS doesn't like all the spaces there. Will try to see if clearing that up help (thinking the Hass templating logic is putting a bunch of newlines there or something due to the way my template is structured). EDIT2: I tested the tts a bit more, and I think it broke on this particular request but the spaces may not have been the reason; I tried the same request and added more spaces and it worked fine. Also seems that the failed attempt is also cached so it always will fail for that link. Will still try to fix the whitespace but not sure if there's a deeper issue there. 😆 EDIT3 (sorry for all the edits, don't want to flood the thread with unnecessary comments as I dig further): Fixing the whitespace seems to have fixed the above example. Haven't found any other examples with the problem so far, so it may have been related to the extraneous whitespace.

Spoke too soon, seems the whitespace fix didn't really work. This example failed, and doesn't have extra whitespace: I (06:40:33.248) WILLOW: Using WIS TTS URL 'https://willow-inference-server:19000/api/tts?speaker=CLB&text=Backup status is nominal. Core systems are nominal.' E (06:40:34.649) HTTP_STREAM: Invalid HTTP stream, status code = -1 E (06:40:34.650) AUDIO_ELEMENT: [IN_http] AEL_STATUS_ERROR_OPEN,-1 E (55363525) ESP_AUDIO_TASK: Got element[IN_http] error:ESP_ERR_AUDIO_OPEN, stop pipeline[0] E (55363534) AUDIO_FORMAT: failed to get head buf data(line 175) E (55363540) ESP_DECODER: Audio type detect error (line 147) I (06:40:34.678) WILLOW: WIS TTS playback finished I (06:40:44.680) WILLOW: Wake LCD timeout, turning off LCD

Here's another example that has never had extraneous whitespace that also failed: image

kristiankielhofner commented 1 year ago

Can you try building Willow with higher HASS_SPEECH_MAX_LEN defined? It uses strncpy but it should be fine with higher values.

In terms of the additional spaces, that's generally "not good" and I can't think of a reason why someone would want text strings with extra whitespace passed to the model. This would be easy enough to strip on the WIS side.

In terms of caching, we do need to implement better cache control. What we've been doing ourselves is just rm -rf nginx/cache but nginx supports the ability to define routes for cache control that allow for network requests to purge cache items. However, this is only available with a commercial subscription :(. I've seen some config hacks floating around to kind of attempt this with the open source version of nginx but they're not pretty...

There is also the ability to set max_length with SpeechT5 on the WIS side. The default is 600 tokens but mapping characters/words to tokens is tricky. We could investigate increasing that value as well.

nikito commented 1 year ago

Ok I'll try to bump the value and rebuild/test. The whitespace I managed to get rid of, had some jinja2 templates where I forgot to add the trim characters to strip it between template conditionals, so that's no big deal. Still not sure why that last example I posted failed though, pretty normal string with no extraneous whitespace.

Is the nginx cache in the WIS container? or is it in the nginx one?

nikito commented 1 year ago

So I bumped the size up to 128, which is better but still can't handle some larger replies. May try 256 just to see what happens. I tried a smaller version of my weather request, which just gives temperature and feelslike temperature, but it failed with the weird errors I showed above: image

kristiankielhofner commented 1 year ago

The nginx cache is in nginx :). What's really nice about the nginx cache is that WIS doesn't even see the TTS request and we can leverage all of the performance made available by the nginx cache and nginx generally.

In terms of these weird errors, can you provide the output from WIS with LOG_LEVEL=debug set in .env ?

I'll be doing another deep dive on TTS next week and may be looking for some additional testing...

nikito commented 1 year ago

So funny thing, I bumped up the hass reply size to 1024, and did my weather question. Worked perfectly with TTS lol. Likewise, I tried the issue above and that also worked fine. But then I tried asking "What lights are on" and got this in my monitor console: image

Looking on WIS, I see this: image

Looking at this, seems I am getting permission denied somewhere? Not sure why it'd do that sometimes and not others, very odd 😮

kristiankielhofner commented 1 year ago

What happens when you curl, wget, or try to go to that WIS TTS url in a browser?

Also - can you possibly provide this output as text instead of screenshots? It's slow and error-prone for me to read these URLs and output and try to reproduce locally.

nikito commented 1 year ago

So, I went ahead and deleted the nginx cache, and now it seems fine so far. I tried all the above examples and every one worked perfectly. So maybe my cache was corrupted somehow?

Also seems the 1024 value for the speech array and HASS reply is good for my examples, though not sure if we should bump this up to a larger value in case larger replies come along (thinking in the future if people start using the chatbot, it may generate long TTS replies)? I can experiment with bumping it up further if you want?

kristiankielhofner commented 1 year ago

Nginx cache - WEIRD. We haven't seen any issues with it and it's very highly tested and robust so I'm surprised by this...

In terms of the value for the replies, we should probably attempt to generally align it with the SpeechT5 model configuration max_length. I'd also like to run this by @stintel to make sure there aren't any issues that I may not be considering.

nikito commented 1 year ago

Entirely possible in my tinkering/building/rebuilding I may have messed something up 😆 I also was running as root originally because I was lazy, and recently went ahead and added my normal user to dialout and docker group so I could do the commands without sudo/root. Maybe that caused these permissions issues I was seeing.

kristiankielhofner commented 1 year ago

We should probably add a purge-cache arg to utils or something that just blows away the cache (for a variety of reasons).

stintel commented 1 year ago

In terms of the value for the replies, we should probably attempt to generally align it with the SpeechT5 model configuration max_length. I'd also like to run this by @stintel to make sure there aren't any issues that I may not be considering.

I chose 64 solely on the responses I got testing intents locally. None of them came even close to 63 characters so I figured 64 would be enough. Guess that wasn't a good decision. As for aligning with SpeechT5 ... Maybe we should change it to a pointer and use calloc instead?

stintel commented 1 year ago

Can you try https://github.com/toverainc/willow/tree/issue/159?

nikito commented 1 year ago

Can you try https://github.com/toverainc/willow/tree/issue/159? Will do, will report back!

nikito commented 1 year ago

So when I went to build after configure, I get this error: image EDIT: used screenshot as the page formatting was making it unreadable 🤣

stintel commented 1 year ago

I doubt that's related to the issue/159 branch. Try removing the build directory. If that doesn't work, ./utils.sh destroy and start from scratch.

nikito commented 1 year ago

Yeah just tried that thinking the same thing and it worked. Will continue and let you know how it goes, thanks!

nikito commented 1 year ago

Tested with my longest output intent and it worked just fine! 😃

stintel commented 1 year ago

Tested with my longest output intent and it worked just fine!

How does the display handle that long text?

nikito commented 1 year ago

Ran a few tests, it seems fine except that it truncates the last character of the string. So for instance if the last word is "detection" the display showed "detectio". Likewise if the last part is "degrees." I get "degrees" (the period is truncated). Otherwise no other anomalies that I noticed. 😊 Edited as auto correct corrected my truncated text 😂

On Wed, Jun 14, 2023, 7:36 PM Stijn Tintel @.***> wrote:

Tested with my longest output intent and it worked just fine!

How does the display handle that long text?

— Reply to this email directly, view it on GitHub https://github.com/toverainc/willow/issues/159#issuecomment-1592126116, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTDKNGOK2SWNM3NHPI7D3XLJDHDANCNFSM6AAAAAAZAQOBMA . You are receiving this because you authored the thread.Message ID: @.***>

kristiankielhofner commented 1 year ago

This is great feedback (it doesn't crash - yeah)!

Generally when it comes to the display it's more-or-less lowest priority (not only in terms of task priority but also functionality). The general use case is assumed to be far-field voice where text on the very small display can't be read from distances of more than a few feet away anyway. Not to mention the relatively slow rate of horizontal scroll, etc. Speaking for myself personally I question the use-case of sitting in front of the device and waiting for the text to scroll from the output of a command - perhaps other than users with hearing issues, which is mostly why it is there in the first place (plus it's kind of cool).

nikito commented 1 year ago

Agree, I actually had to keep poking the screen to wake it up to see all the text, especially on the paragraph-long weather output 🤣

stintel commented 1 year ago

Ran a few tests, it seems fine except that it truncates the last character of the string.

Classic off-by-one bug!