[speech-to-text] Add support for continuous listening

watson-developer-cloud / swift-sdk

:iphone: The Watson Swift SDK enables developers to quickly add Watson Cognitive Computing services to their Swift applications.

https://watson-developer-cloud.github.io/swift-sdk/

Apache License 2.0

877 stars 223 forks source link

[speech-to-text] Add support for continuous listening #147

Closed glennrfisher closed 8 years ago

glennrfisher commented 8 years ago

Finish implementing and testing the SpeechToText.startListening function.

mitulgolakiya commented 8 years ago

dladouceur commented 8 years ago

HI, this is Dave Ladouceur from Neon Mobile -- we were using your old SDK and have moved to this one and need continuous listening for our production release this month. Is there any way we can help. We have SIRI like capability for small medium business and need Continuous Listening for a similar user experience.

Do you have date in mind for this issue ? We are developing a cordova plugin and need the Continous Listening with progress call-backs and final data as per existing APIs.

We would need the same for Android and are developing our own Windows Mobile -- since I do not believe this is on your roadmap ?

rfdickerson commented 8 years ago

Hi @dladouceur , we have continuous listening on our roadmap for beta release 0.2.0. The roadblock we are facing is that the Ogg library will not successfully packetize our compressed PCM audio. The following function call always returns nil:

 let newData = ogg.writePacket(compressed,
                    frameSize: Int32(WATSON_AUDIO_FRAME_SIZE))

Using the PCM + OGG formats have decreased the audio sizes by at least 1/5 the original size, which is helpful for connections without WiFI.

In order to get continuous listening working while we figure out the issue about the OGG packets will be to simply stream through the WebSocket the PCM data captured. We will use the callback delegate you have found for returning progress.

dsato80 commented 8 years ago

I also want to use continuous listening and want to request some additional functionality. Issue is the right place to put my requests?

glennrfisher commented 8 years ago

@dsato80: Yup! We have good visibility across our team for Github issues. If you'd like to request any additional functionality, please feel free to open an issue. We'll start a conversation and add it to our backlog. Looking forward to hearing your thoughts!

Feel free to submit pull requests, as well, if you'd like to dive into the code.

glennrfisher commented 8 years ago

Copying @dsato80's excellent bug note from #146 for future reference here:

I found a bug in https://github.com/watson-developer-cloud/ios-sdk/blob/master/WatsonDeveloperCloud/SpeechToText/SpeechToText.swift#L126-L140

if var audioState = audioState { this statement copies self.audioState struct value to temporal audioState variable. It causes EXC_BAD_ACCESS error because temporal variable is released when startListening is returned.

dsato80 commented 8 years ago

It is very helpful that I can prepare websocket connection before startListening. For example SpeechToText.connect(completionHandler) Currently the websocket connection takes a few seconds, it is critical for speech dialog system with STT.

SpeechToText.connect(completionHandler)
SpeechToText.disconnect(completionHandler)
SpeechToText.startListening()
SpeechToText.stopListening()

It is also helpful that I can send no-op message to prevent service timeout. https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/speech-to-text/using.shtml#timeouts

When using the WebSocket protocol, you can send a message with no-op as the action to keep the session alive.

dladouceur commented 8 years ago

This example requires a client "stop" and there is no way that we know to easily detect silence on IOS ? So how can I know when to send "STOP". Please advise how to get final packet without the STOP or waiting 30 seconds?

In the third example, the client again sends audio that contains the string Name the Mayflower. As with the first example, the client sends the audio in two chunks in PCM format. This time, the client asks the server to send interim results for the transcription.

Client>> {"action": "start", "content-type": "audio/l16;rate=22050", "interim_results": true}
Server<< {"state":"listening"}
Client>> <audio data chunk>
Server<< {"results": [{"alternatives": [{"transcript": "name "}],"final": false}],"result_index": 0}
Server<< {"results": [{"alternatives": [{"transcript": "name may "}],"final": false}],"result_index": 0}
Client>> <audio data chunk>
Client>> {"action": "stop"}    <--------------------------
Server<< {"results": [{"alternatives": [{"transcript": "name may flour "}],"final": false}],"result_index": 0}
Server<< {"results": [{"alternatives": [{"transcript": "name the mayflower "}],"final": true}],"result_index": 0}
Server<< {"state":"listening"}

dsato80 commented 8 years ago

I think this is just an example of STT API. You can send STOP when you get the final result. The service detects the end of the phrase. We need to dig into raw audio data to get silence on iOS. #152

By default, the service stops transcription at the first pause, which is denoted by a half-second of non-speech (typically silence), or when the stream terminates. This is referred to as an end of speech (EOS) incident.

Client>> {"action": "start", "content-type": "audio/l16;rate=22050", "interim_results": true} Server<< {"state":"listening"} Client>> Server<< {"results": [{"alternatives": [{"transcript": "name "}],"final": false}],"result_index": 0} Server<< {"results": [{"alternatives": [{"transcript": "name may "}],"final": false}],"result_index": 0} Client>> Server<< {"results": [{"alternatives": [{"transcript": "name may flour "}],"final": false}],"result_index": 0} Server<< {"results": [{"alternatives": [{"transcript": "name the mayflower "}],"final": true}],"result_index": 0} Client>> {"action": "stop"} <-------------------------- Server<< {"state":"listening"}

dsato80 commented 8 years ago

Hi mitulgolakiya

The data buffer is allocated by the system and might be freed after adding to the queue. I think, you need to copy the buffer data for the queue.

self.stt.appendData(data.copy() as! NSData)

The code includes your credential. So I recommend you to change the credential.

mitulgolakiya commented 8 years ago

@dsato80 This solution is working and now we are getting final flag true with proper recognized text. Thanks.

glennrfisher commented 8 years ago

We've been asked to redesign the API for SpeechToText and will be looking at it next week. We will probably write the service from scratch, but will draw from existing code and tests in the process.

We expect this to be a breaking change that will require client code to be modified. We'd love to get all of your feedback, though, after we push the updates!

glennrfisher commented 8 years ago

Closed in favor of #168.