morganney / tts-react

Convert text to speech using React.
https://morganney.github.io/tts-react/
MIT License
50 stars 7 forks source link

How to combine with Microsoft Azure speech API? #46

Closed 55Cancri closed 11 months ago

55Cancri commented 1 year ago

The Microsoft AI voices are the best in the industry. They are natural sounding and have better cadence than the native WebSpeech Api. However, I am not able to highlight the currently playing word with microsoft-cognitiveservices-speech-sdk. How can I merge microsoft-cognitiveservices-speech-sdk with your package?

morganney commented 1 year ago

Thanks for your question. This package is meant to be used on the web, in a browser, backed by native browser API's. The microsoft-cognitiveservices-speech-sdk, clearly does not fall into this category.

Can you provide some more detail on how exactly you would want to merge this package with an external service and custom API? From the outset, I can tell it would require a chunk of work to adapt the custom API from MS to the native Browser Web Speech API. There is also the security issue of dealing with tokens/keys that you probably don't want to expose to the browser, so a backend would be required (this is most likely a deal breaker for integration with this package).

Check out the Web SDK example: https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/js/browser

naeem-hassan commented 1 year ago

The Microsoft AI voices are the best in the industry. They are natural sounding and have better cadence than the native WebSpeech Api. However, I am not able to highlight the currently playing word with microsoft-cognitiveservices-speech-sdk. How can I merge microsoft-cognitiveservices-speech-sdk with your package?

@morganney

I'm also looking for this. I have got a backend server where I get the voice from Microsoft-cognitiveservices-speech-sdk. Now, save the speech file to the server upload it to S3 and get the link and send it back to the client. Now, what I want is that, as the audio plays, the word must be highlighted. I've seen the storybook but couldn't find it more informative as there were mentioned a couple of things that were confusing. I've also been searching for Aws Polly, but couldn't find anything informative as you've mentioned in the storybook that we can use AWS Polly to get the audio data. Hope you got the point.

morganney commented 1 year ago

@naeem-hassan If the backend using microsoft-cognitiveservices-speech-sdk can return data matching the TTSAudioData interface then you should be able to use the fetchAudioData prop to get what you want. To highlight the word spoken the backend needs to return the marks in the same format used by AWS Polly Speech Marks. There is an example story, check the source code in the repo.

naeem-hassan commented 1 year ago

@morganney morganney I don't think so microsoft-cognitiveservices-speech-sdk is returning the Marks or is there any API or parameter to get this? Is there any free API to get the marks, as the AWS Polly is pretty expensive for me right now?

morganney commented 1 year ago

A little bit of googling shows word boundaries should be supported: