Open nikito opened 1 year ago
Yes, they are absolutely related. There's certainly some regex, etc we could develop that would pre-split that but to get the output that's common in American English speech would need to go from "8:01 AM" to "eight o one a m" is tricky. Additionally, we'd need to account for languages/locales that don't use AM and PM, or represent them differently. Frankly I don't know a good way to do that off the top of my head...
Yeah agree, it is a tricky problem. It may also just be a limitation of the TTS in use. I know when I try this with Coqui or Piper they both handle numbers and these time values better (they still struggle with AM/PM, but they appear to say eight o' one and such with the syntax shown). But from a linguistic approach the regex or NLP for that would definitely get complex, probably require it's own library/module just to handle that 😮
Yeah this certainly brings back the issue regarding alternate TTS engines. The issue there is in the (ample) time I've spent with them I'm increasingly convinced they're just not suitable for Willow as-is. As noted before, the dependencies are VERY messy and the performance is lackluster (we'd still have caching, so there's that).
At this point TTS is our biggest fundamental issue and these issues just keep flowing in compared to the rest of Willow/WIS functionality. We certainly appreciate them, so keep them coming! My concern with coqui, etc is we'd likely just swap a number of issues with an equivalent set of different set issues. The open source models and frameworks for TTS generally seem to be very far behind their counterparts in the rest of the "model world" and none of them seem to meet our overall goal of providing a truly commercial quality voice user interface.
So, as frustrating as it is currently, SpeechT5 is likely (in the end) still our best option and it's clear that it will need significant pre-processing of text before it's provided to the processor itself. If you look at sentencepiece you can start to understand the fundamental challenges that all of the text models and architectures have...
If it's of any intrigue to you, for laughs I was able to get Willow to work with Coqui running as an independent server and I get pretty good response times. I did have to modify the Willow code to add DEFAULT_ESP_WAV_DECODER_CONFIG() as Coqui outputs wav, but otherwise works perfectly. Of course this isn't utilizing nginx cache, but still very performant. 😄
That's great feedback!
Generally my goal (considering the level of effort required) is not only clean dependency and code, but an actual performance improvement. Of course comparable performance is the absolute minimum.
VITS-based models are clearly (to me at least) the future and I've been working through Coqui and others to get VITS models working with ONNX, CUDA, and TRT (the hard one) which not only leads to easier packaging, dependency management, and cleaner distribution but higher performance as well.
In terms of the WAV decoder on device, we're really trying to make FLAC the standard in terms of audio decoding (ESP ADF doesn't currently support FLAC encoding) as opposed to building additional libraries, bloating the firmware image, and trying to do weird parsing of Content-Type and other conditionals to determine the decoder.
Yeah only reason I added WAV was because the coqui docker image API doesn't output in FLAC unfortunately. The model I am playing with is the one we discussed before (the VITS model trained with the Jenny dataset). Would agree if it can be exported to ONNX/CUDA/TRT it would probably be much faster and easier to implement. 😃
Sorry for hijacking this, but how do you get Willow to speak the response? The only thing I've managed to get it to speak is "ok".
Ah, I see this only works with Home Assistant, is that correct? There's no way to play audio with the REST server?
https://github.com/toverainc/willow/pull/225
On Sun, 20 Aug 2023, 10:59 Stavros Korokithakis, @.***> wrote:
Ah, I see this only works with Home Assistant, is that correct? There's no way to play audio with the REST server?
— Reply to this email directly, view it on GitHub https://github.com/toverainc/willow/issues/160#issuecomment-1685242948, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBRBFEYAJJMYG4GWEASYF3XWHNWFANCNFSM6AAAAAAZARUNOY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@nikito would be interested on how coqi would be configured for. maybe you have notes or would be willing to create them ?
On TTS latency i am not very concerned as its the audio feedback only. STT and execution latency i'd guess would be unaffected.
On the current TTS it seems there is some workaround intended but may not actually be applied correctly as i get these in logs:
Got request for speaker CLB with text: The current time is 10:03 AM on May 31, 2024. TTS: Text contains numbers, converting to words TTS: Text after number substitution: ['The current time is 10:03 AM on May 31, 2024.']
So it seems there was the intention to convert this like "10" -> "ten" or "one zero", but didnt happen... ?
Yeah speech t5 isn't great with numbers and such, so I'm split_arch branch we make coqui the new default, with xtts as another option for those who want to tinker. There's no notes on it yet as it's still a developmental branch, but if you take a look in the utils.sh there's a method added called build-xtts, and once you run that it tells you how to turn on xtts. Otherwise building and deploying from that branch will automatically use coqui.
For example, "The time is 8:01 AM.", it will speak "The time is am", where am is the literal word "am" as in "I am". The time component itself is entirely skipped, I think that is related to the number handling issue we saw before mentioned in #148 .