jishi / node-sonos-http-api

An HTTP API bridge for Sonos easing automation. Hostable on any node.js capable device, like a raspberry pi or similar.
http://jishi.github.io/node-sonos-http-api/
MIT License
1.85k stars 462 forks source link

SSML support for TTS (Amazon Polly) #517

Closed AdmPicard closed 3 years ago

AdmPicard commented 7 years ago

Hi,

I am really enjoying your API in conjunction with Alexa and other Smart Home devices and regularly use it for notifications and small announcements as well. While I am using Amazon Polly Vicki as a nice and fluent German voice, I am missing (or did not find) the ability to use SSML syntax (or even lexicons).

As the pronunciation of some words could be better and for better controlling breaks and such things SSML would be a great help. One could also use the new whispered effect, although it sounds a little creepy. Is this even possible right now or would it be possible to implement this with a suitable amount of effort?

Thank you very much and have a nice day! Daniel

jishi commented 7 years ago

Hi! Yes you are right that SSML is not supported. I tried keeping it simple, and also, for dynamic content, it would be hard to produce any improvements with SSML unless the source also produces SSML.

Lexicons however, are context based, and would be a nice addition when producing dynamic phrases. However, Lexicons must also be "created" or provided, don't remember fully, which also introduces more complexity.

It also sounds like you are trying to produce "static" clips using the TTS system. If that is the case, you could also consider creating your clips manually, and then use the clip/clipall command to play them.

Implementing lexicons and SSML support is a bit tricky today because of the GET only schema, which makes it hard to allow complex text and extra parameters. I have some ideas on how to introduce POST commands (or query parameters) as well, which would add greater flexibility and make it easier to adapt the parameters based on which TTS provider you are currently using.

AdmPicard commented 7 years ago

Thanks for your quick response! Keeping it simple is a good point.

I am mixing static clips and dynamically generated TTS snippets. For instance I am playing a sound clip and a TTS message like the subject of an e-mail afterwards. It is already a great help, that the API is waiting for the clip/say command to finish for easily concatenating different elements.

For example I may generate this TTS phrase:

the software <string> contains <int> bugs and <int> usability issues

Now the German voice is not able to correctly pronounce bug, why I have to adjust it via SSML phonemes or with a lexicon, which I have uploaded in the Polly console. So as an alternative I could split a sentence like this one into multiple parts with dynamic TTS + static TTS clip for bugs + dynamic remaining part and play them each after the other. I might give this a try or could multiple clips/says cause trouble?

I already noticed, that resuming the previously active stream does not work after consecutive clip/ say commands (remains in STOPPED state). But restarting it manually with a simple PLAY is not an issue at all.

Looking forward to the possible POST progress and thanks for your help and work!

AdmPicard commented 7 years ago

Hi again,

after more experiments on how to structure notification texts and sounds I am experiencing growing lags / gaps between different consecutive TTS snippets - the longer the message, the longer the gap (quite naturally). Do you think, that a kind of prestaging TTS files would be possible?

Something like passing a TTS phrase to the API and receiving the file name or its hash as a response (after the download has been finished) for being able to play it via a clip/ clipall afterwards? In this case I would be able to generate all TTS files before I start to play them, which would guarantee a gapless playback.

I know, that the issue remains very specific, but I just want to bring it into discussion. Thank you!

JsChiSurf commented 7 years ago

Not to speak on behalf of jishi, but each TTS you run is cached to the /static/tts folder with a hash that I'm assuming calculates out to the text for which is converted to its corresponding wav file.

In this manner, if the same text is requested in the future, it uses the cached file, rather than regenerating a new file to say the same thing. I have a handful of things that I announce quite frequently, such as 'the garage down is now open', 'the garage door is now closed', etc. And as such, for any repetitive announcements, when I brought the server up, I manually ran each of those TTS requests, so as to cache them to the server for future use. Any new, random TTS's (like weather, quotes of the day, etc), that are always different, I run a nightly script to clean out the /static/tts folder for wav files I know will never be used again (ie, they are not in my known list of frequently used cached files).

I'm thinking you could pre-cache (even if the text is always different), by running your say/sayall commands, but with a volume of 0 (/Bathroom/say/hello world/0). This way, they are NOT heard, but are cached, so that the next time you actually want them to run and be heard, they will be cached and run quicker, rather than having to wait for the conversion to occur.

This is certainly a workaround, but something to consider.

AdmPicard commented 7 years ago

@JsChiSurf Nice idea!

I'm thinking you could pre-cache (even if the text is always different), by running your say/sayall commands, but with a volume of 0 (/Bathroom/say/hello world/0).

This would solve the problems of waiting for the new TTS requests to finish. I also thought about this before, but just didn't think at setting the volume to zero for generating the audio files without actually playing them. I could utilize a less frequently used speaker for caching the files, because of course it will interrupt any other playback - unfortunately I cannot generate the snippets during the night as I need them for notifications, which should be on time.

I will definitely give it a try as this can be implemented fairly easy! I just have to play my clips twice - silently for caching them and aloud for fluent playback. Thank you for pointing this out!

Update

I just tried it now and it works fine as expected. No delays and fluent playback. Just the additional delay due to fully playing the announcement silently before the actual output and the blocked speaker are not ideal. But right now I am fine with this workaround!

jishi commented 7 years ago

I'm not really following what issue you guys are trying to mitigate by caching the responses. Is there some silent delay while generating/downloading the clip? I thought I waited for the download to finish before I pause current playback. That might be a bug, if that's case.

AdmPicard commented 7 years ago

Is there some silent delay while generating/downloading the clip? I thought I waited for the download to finish before I pause current playback. That might be a bug, if that's case.

No, there isn't silence before playing the file - it is waiting for the download to finish. There just can be gaps when playing several TTS files right after another or after a clip command.

To my situation: I am using a Raspberry Pi to regularly execute a Python script, which is addressing your API in order to issue say and clip commands. My TTS phrase now can get a little longer (10 - 40 seconds; a kind of short briefing). And if I issue a clip command first and a say command afterwards, I was regularly experiencing delays prior to the TTS file being played reaching from a second to several seconds.

So I heard my notification sound, but the TTS started with a notable delay. If music was running before, I heard my clip, music starts to play again for a short duration and then the TTS starts. The delay probably depends on the length of the TTS clip.

With the approach of @JsChiSurf I was able to circumvent this issue. Now even longer TTS clips directly play after my clip command as they are already locally present. But as this workaround is blocking one of my speakers prior to the actual notification, a regular way of prestaging TTS files would be nice. So a command to download but not to play a TTS snippet for instance. In this case it will directly be present for the next regular say command.

JsChiSurf commented 7 years ago

@AdmPicard glad this is working for you as well.

@jishi We both are using this approach to address the same "issue" as described above. It is certainly a work around that works, but a more eloquent / official option would be great too.

jishi commented 7 years ago

Aha, so it's the combination of clip + say that causes this. I understand the problem now. I'll see if I can come up with some way of mitigating this.