Add the ability to receive the Word timing metadata of a TTS stream

chubbard commented 9 years ago

We currently use TTS for our product, but the quality of this is much better than anything else out there. We'd like to use this, but our product requires that we know the timings of when words (or phonemes) are said. One of the TTS products we currently use gives us this data and we use it. It'd be nice if this would export that metadata along with the audio file so we could use this service.

We currently parse this from the metadata in an id3v2 tags that are stuck in Comment sections. We currently get something like this:

timed_phonemes word start_in_ms end_in_ms amplitude word ... ...

Not married to the format, but just to illustrate what type of information we'd need.

daniel-bolanos commented 9 years ago

Hello chubbard. This is very useful feedback, I'm glad you like the quality of our TTS service. Providing the timing information for the words should not be difficult technically. I'm going to check with the person who manages the TTS effort and will get back to you.

daniel-bolanos commented 9 years ago

Hi Chubbard, I asked about the timing information for TTS and it is technically doable. Actually there are plans to implement this feature in the future. However I cannot give you a specific release date.

chubbard commented 9 years ago

You've already done the hard part so I knew this is not hard. :-) This would be a requirement for us before we accepted Watson TTS as our TTS, but I'm encouraged how fast you got back with us. I'll be waiting to hear when it'll be ready. Thank you.

jsstylos commented 9 years ago

Hi @chubbard, for your use case does the timing information have to be at the beginning of (or interspersed with) the audio, or would timing information that followed the audio be sufficient?

chubbard commented 9 years ago

Hi,

I'm not exactly sure what you mean with respect to "interspersed with audio". I'll try and describe how the solution we use works now in hopes that will answer it.

We send the phrase we want converted to audio, and the service sends back to us an MP3 file. In that MP3 file we can pull out the timings from a single comment tag in the ID3v2 tag attached to the file. So it's really just one embedded text string that contains the start and end times of every word spoken in the audio. Logically the data could be a completely separate file from the audio, but packaging the data in a metadata comment is convenient since it's a single call to the service to retrieve both the audio and the word timings. This way I don't have to make a call to get the audio then make another call to get the word timings because we are doing this in realtime which makes us sensitive to extra 3rd party service calls. The ID3 tag is a convenient packaging by being embedded in the file in this manner. If it's a single network call to retrieve both then I'd say that would be fine too.

Technically the ID3v2 tag is always at the end of the MP3File which it's embedded in file, but I wouldn't say that is interspersed with the audio. But again that depends on your personal definition of what interspersed means (embedded within the file once - beginning, end,etc or multiple times alongside each audio frame).

Does that answer your question?

jsstylos commented 9 years ago

Mostly I was curious if the metadata needed to proceed the audio in your use case or not. I was under the impression that most ID3v2 tags we at the beginning of mp3 files, but if you're OK with getting the information at the end of the file, then that answers my question, and gives us more flexibility in API design. Thanks.

chubbard commented 9 years ago

My team is very excited to try this out in our application. Anything you can share with me would be greatly appreciated.

daniel-bolanos commented 9 years ago

Hi @chubbard, unfortunately I have no news regarding this feature. The current status is that there are plans to implement this feature in the future. However I cannot give you a specific release date. In case you can answer this question, would your application produce a significant amount of TTS traffic if we had that feature?

chubbard commented 9 years ago

I'm not sure how to answer this. From my perspective it is a quite significant amount of TTS traffic. We are an education product with close to 90,000 problems, and 1500 instructional slide shows that we would use TTS on those. I'm not sure I can quantify it anymore than that, but we already spend several thousand dollars a year on TTS. We are a startup so we haven't branched out to multiple subjects yet, but we'd expect to use it for those as well. We also are just English right now, but we have plans to branch out to other languages so we expect our usage to grow.

I sure hope that is significant in your book because we really want to use your TTS solution.

kbrice commented 9 years ago

Daniel, I work with @chubbard (I'm the CEO of the company), and we're nearing a decision point on selecting a technology for our TTS solution. Candidly, we're very positive about your technology, and would like very much to find a way to implement it. Our option is to move to recorded voice. Is there someone within your organization I can speak with to determine how we might plan some of these features that you've already identified as on your roadmap? Thx

daniel-bolanos commented 9 years ago

Hello @kbrice, yes. Our management is already aware of your needs, I have kept them up to date regarding your needs, last time a couple of days ago. I think it is a great idea that you talk to them directly. I will reach out to them so you can initiate the conversations. Thank you for your interest in the TTS service.

kbrice commented 9 years ago

Sounds great--thanks for your help, Daniel!

daniel-bolanos commented 9 years ago

Hello @kbrice, please send me an email to dbolano@us.ibm.com, I'm setting up a call with our Project Manager so we can discuss this asap.

thank you

germanattanasio commented 8 years ago

@daniel-bolanos should we close this ?

kbrice commented 8 years ago

Guys, I think the timing on the technology and our request to their is well understood. The question still on the table is that of marketing. Who can connect me with someone in the marketing group for Watson to discuss GPA being a case study for the TTS, If you will?

Thanks, KB

Kevin Brice President and CEO GPA Learn, home of LoveMath +1.404.428.2274

This message is sent from my mobile device. My apologies for abbreviations and auto – incorrections. KB

On Oct 22, 2015, at 12:26 AM, German Attanasio Ruiz notifications@github.com wrote:

@daniel-bolanos should we close this ?

— Reply to this email directly or view it on GitHub.

daniel-bolanos commented 8 years ago

Hi @kbrice, let me forward that to Aadhar, our project manager.

Dani

kbrice commented 8 years ago

Thanks much. Hope all is well. KB

Kevin Brice President and CEO GPA Learn, home of LoveMath +1.404.428.2274

This message is sent from my mobile device. My apologies for abbreviations and auto – incorrections. KB

On Oct 22, 2015, at 10:14 AM, Daniel Bolanos notifications@github.com wrote:

Hi @kbrice, let me forward that to Aadhar, our project manager.

Dani

— Reply to this email directly or view it on GitHub.

rawalbaig commented 8 years ago

Hi Daniel!

Great Service :) I want to know is there any way we can control Speed of voice and expressions with SSML. i.e 1) Speak a line fast, slow, medium etc 2) Expressions like Joy, Excited, Hopeless, Tense, Meaningful etc.

Thanks Rawal Baig

jzhang300 commented 8 years ago

@rawalbaig There is a web audio api on the client side if you want to manipulate audio data: http://www.html5rocks.com/en/tutorials/webaudio/intro/.

nfriedly commented 8 years ago

Hey, I know I'm late to the party but, here's a completely ridiculous idea that might just work, at least as a stopgap measure: The Watson Speech to Text service can output word timings.. so why not run the audio from TTS through STT, grab the word timings, and then bundle it all together?

I'll try and set u a quick demo to see how well it works.

daniel-bolanos commented 8 years ago

Hello @rawalbaig , yes, we just released an expressive voice for TTS, it is great. Please check the TTS demo:

https://text-to-speech-demo.mybluemix.net/

This is the kind of markup that you can pass to the voice:

<speak>I have been assigned to handle your order status request.<express-as type="Apology"> I am
 sorry to inform you that the items you requested are back-ordered. We apologize for the
 inconvenience.</express-as><express-as type="Uncertainty"> We don't know when those items will 
become available. Maybe next week but we are not sure at this time.</express-as>
<express-as type="GoodNews">Because we want you to be a happy customer, management has
 decided to give you a 50% discount! </express-as></speak>

rawalbaig commented 8 years ago

@jzhang300 Many thanks for your kind information.

@daniel-bolanos Thanks :) Actually I have check it already. Currently supports only three expressions Apology, Uncertainty and GoodNews. When more epressions will come.

But there is no option to control the voice speed :(

jzhang300 commented 8 years ago

@rawalbaig Take a look at this: https://developer.mozilla.org/en-US/docs/Web/API/AudioBufferSourceNode/playbackRate

Apparently in the AudioNode API, there's a playbackRate property that can be controlled. I think that's what you're looking for :)

nfriedly commented 8 years ago

BTW, I now have a basic demo that provides both audio and word timings by combining TTS and STT:

Demo: http://watson-tts-timing.mybluemix.net/
Code: https://github.com/nfriedly/tts-timing

I went back and forth a bit about whether to process on the server (better quality, allows the audio to be captured and saved) or on the client (simpler, can leverage existing code for HTML output), and ended up implementing both. The server-side is a little bit hacky but works as long as only one person uses it at a time :p

@chubbard @kbrice Please try it out and let me know if something like that would suffice until we get it implemented properly in the TTS service.

kbrice commented 8 years ago

Thanks, Nathan. We'll check it out and get back to the group. KB

On Tue, Feb 23, 2016 at 5:36 PM, Nathan Friedly notifications@github.com wrote:

BTW, I now have a basic demo that provides both audio and word timings by combining TTS and STT:

Demo: http://watson-tts-timing.mybluemix.net/

Code: https://github.com/nfriedly/tts-timing

I went back and forth a bit about whether to process on the server (better quality, allows the audio to be captured and saved) or on the client (simpler, can leverage existing code for HTML output), and ended up implementing both. The server-side is a little bit hacky but works as long as only one person uses it at a time :p

@chubbard https://github.com/chubbard @kbrice https://github.com/kbrice Please try it out and let me know if something like that would suffice until we get it implemented properly in the TTS service.

— Reply to this email directly or view it on GitHub https://github.com/watson-developer-cloud/text-to-speech-nodejs/issues/10#issuecomment-187945698 .

Kevin Brice | President and CEO

GPA Learn, home of LoveMath™

www.gpalearn.com/schools

(M) +1.404.428.2274 <%2B1.404.428.2274>

Winner--ISTE 2015 Best of Show

kbrice commented 8 years ago

Daniel, can we grab just a few minutes for a quick phone call? I'm trying to understand where your Watson technology is presently. I like the emphasis capability you have. Is that available in male voices as well?

I would also like to reconnect since our conversation with Aadhar. We discussed possibly some marketing motion. Thanks! KB

On Tue, Feb 23, 2016 at 12:23 PM, Daniel Bolanos notifications@github.com wrote:

Hello @rawalbaig https://github.com/rawalbaig , yes, we just released an expressive voice for TTS, it is great. Please check the TTS demo:

https://text-to-speech-demo.mybluemix.net/

This is the kind of markup that you can pass to the voice:

I have been assigned to handle your order status request. I am sorry to inform you that the items you requested are back-ordered. We apologize for the inconvenience. We don't know when those items will become available. Maybe next week but we are not sure at this time.Because we want you to be a happy customer, management has decided to give you a 50% discount!

— Reply to this email directly or view it on GitHub https://github.com/watson-developer-cloud/text-to-speech-nodejs/issues/10#issuecomment-187799457 .

Kevin Brice | President and CEO

GPA Learn, home of LoveMath™

www.gpalearn.com/schools

(M) +1.404.428.2274 <%2B1.404.428.2274>

Winner--ISTE 2015 Best of Show

watson-developer-cloud / text-to-speech-nodejs

Add the ability to receive the Word timing metadata of a TTS stream #10