watson-developer-cloud / swift-sdk

:iphone: The Watson Swift SDK enables developers to quickly add Watson Cognitive Computing services to their Swift applications.
https://watson-developer-cloud.github.io/swift-sdk/
Apache License 2.0
877 stars 223 forks source link

[Speech to Text] Swift Watson SDK can no longer detect end of speech #632

Closed riverbaymark closed 7 years ago

riverbaymark commented 7 years ago

Remember, an issue is not the place to ask questions. You can use Stack Overflow for that, or you may want to start a discussion on the dW Answers.

Before you open an issue, please check if a similar issue already exists or has been closed before.

When reporting a bug, please be sure to include the following:

When you open an issue for a feature request, please add as much detail as possible:

riverbaymark commented 7 years ago

So I have an application that relies on recognizing when the user stops speaking to trigger the service to convert the speech to text. It was working fine until one day during a demo it finally did not respond (I am using the websocket connection). I could not figure out what happened until I found this:

"The continuous parameter has been removed from all methods that initiate recognition requests. The service now transcribes an entire audio stream until it ends or times out, whichever occurs first; this is equivalent to setting the former continuous parameter to true. By default, the service previously stopped transcription at the first half-second of non-speech (typically silence) if the parameter was omitted or set to false.

Existing applications that set the parameter to true will see no change in behavior. Applications that set the parameter to false or that relied on the default behavior will likely see a change. If a request specifies the parameter, the service now responds as it does to any unknown parameter: by returning a warning message with its response:

"warnings": [ "Unknown arguments: continuous." ] The request succeeds despite the warning, and an existing session or WebSocket connection is not affected.

IBM made this change in response to overwhelming feedback from the developer community that specifying continuous=false added little value and could reduce overall transcription accuracy."

My application relied on setting the continuous to false. Is there a workaround planned to still be able to detect the end of speech or is it going to require stepping back in time and writing boilerplate code to detect microphone settings. I would vote this change added huge negative value for anyone trying to develop real-time conversation applications. This basically renders my application useless at present time and will anyone's that tries to use the service in similar fashion. Why not just default to continuous = true?

Thanks,

Mark

glennrfisher commented 7 years ago

Hi @riverbaymark! Sorry for our silence here--much of our team has been on vacation or traveling for conferences recently. I'm sorry for the delay in getting back to you.

I agree that the Speech to Text team probably should have only changed the default for continuous to true, but allowed people to specify continuous = false. I will follow-up with some of the folks on the team and try to learn more. I'll link them to this issue, as an example of a use-case for continuous = false.

This is also an example of poor version control, since the service should not introduce any breaking changes without requiring a modification to your app. Many of the other services use version dates so that your application will always run the same so long as you do not change the version date. But Speech to Text does not use version dates, so this change affected everyone.

I'll let you know what I find out from the Speech to Text team, and whether we can convince them to reintroduce the continuous parameter.

In the mean time, it's possible to write a workaround in your application. There are a couple of ways to do this: using the microphone volume or using the transcription results.

Using the microphone volume, you could detect when the volume gets quieter, which may indicate an end of speech event. (Although if your user is in a crowded room or somewhere with a lot of background noise, then that may not be the case.) You can access the microphone volume using the session management functionality of the SDK--particularly, the onPowerData callback. This callback is invoked every 0.025s with the average decibel power of the microphone. This line in the speech-to-text-swift sample is an example of how to print out the decibel rating. You would need to write some code to identify the volume of speech versus the volume of the background noise.

Alternatively, you could use the transcription results themselves. For example, you might assume that the user has stopped speaking if you have interimResults = true but have not received any transcription results in the last second. (This may have issues in a loud environment too, if Speech to Text starts transcribing background conversations.) This StackOverflow answer provides some details on using an timer to detect the end of speech based on transcription results.

Sorry for all of the trouble with your application. I hope we can report some good news to you, soon!

riverbaymark commented 7 years ago

Hi Glenn,

Thanks for your response and all your suggestions. I recently figured out how to recognize end of speech for Apple’s SFSpeechRecognizer utilizing Final results and a resetting timer scenario. Sounds like I could do the same with Watson Speech to Text with your last example. Was just so easy to set the parameter continuous = false to accomplish this as you can well see. Basically I am using speech recognition to form the text to send to Watson Conversation. I think the IBM version may be better since you can actually create a corpus to enhance the recognition. At WWDC this year it seemed like there was a lot of improvement to Siri but I haven’t put her to a good test yet :)

Thanks again and I’m a fan of your code examples and your work with the Watson Swift SDK.

Mark Pruitt MSN FNP phone (415) 656-8617 mark.pruitt@riverbaysoftworks.com

On Jun 14, 2017, at 10:02 PM, Glenn R. Fisher notifications@github.com wrote:

Hi @riverbaymark https://github.com/riverbaymark! Sorry for our silence here--much of our team has been on vacation or traveling for conferences recently. I'm sorry for the delay in getting back to you.

I agree that the Speech to Text team probably should have only changed the default for continuous to true, but allowed people to specify continuous = false. I will follow-up with some of the folks on the team and try to learn more. I'll link them to this issue, as an example of a use-case for continuous = false.

This is also an example of poor version control, since the service should not introduce any breaking changes without requiring a modification to your app. Many of the other services use version dates so that your application will always run the same so long as you do not change the version date. But Speech to Text does not use version dates, so this change affected everyone.

I'll let you know what I find out from the Speech to Text team, and whether we can convince them to reintroduce the continuous parameter.

In the mean time, it's possible to write a workaround in your application. There are a couple of ways to do this: using the microphone volume or using the transcription results.

Using the microphone volume, you could detect when the volume gets quieter, which may indicate an end of speech event. (Although if your user is in a crowded room or somewhere with a lot of background noise, then that may not be the case.) You can access the microphone volume using the session management https://github.com/watson-developer-cloud/swift-sdk#session-management-and-advanced-features functionality of the SDK--particularly, the onPowerData callback. This callback is invoked every 0.025s with the average decibel power of the microphone. This line https://github.com/watson-developer-cloud/speech-to-text-swift/blob/master/Speech%20to%20Text/MicrophoneViewController.swift#L96 in the speech-to-text-swift sample is an example of how to print out the decibel rating. You would need to write some code to identify the volume of speech versus the volume of the background noise.

Alternatively, you could use the transcription results themselves. For example, you might assume that the user has stopped speaking if you have interimResults = true but have not received any transcription results in the last second. (This may have issues in a loud environment too, if Speech to Text starts transcribing background conversations.) This StackOverflow answer https://stackoverflow.com/a/42925643/1519343 provides some details on using an timer to detect the end of speech based on transcription results.

Sorry for all of the trouble with your application. I hope we can report some good news to you, soon!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/watson-developer-cloud/swift-sdk/issues/632#issuecomment-308629395, or mute the thread https://github.com/notifications/unsubscribe-auth/AXElrLt196Lzl1YCPfGHWFxwznVhA6Hxks5sELr2gaJpZM4Nu4Xp.

glennrfisher commented 7 years ago

Thanks, Mark! I appreciate the kind words.

I hope that using the timer and checking for final results will work for your app. In the mean time, I've followed up with the Speech to Text team and will let you know what I hear. Thanks!

glennrfisher commented 7 years ago

I received a great response from the Speech to Text team. They said that the functionality of continuous = false is completely contained in the functionality of continuous = true (which I believe is the rationale for removing the parameter -- it duplicates functionality).

To detect an end of speech event, as if continuous = false, look for a result with final = true. The service sends a final = true transcription result when it detects about a half-second pause in speech. After receiving a final transcription result, you can end the audio stream and close the connection. That's what the service did internally when continuous = false.

There's one catch that I'm worried about. The bestTranscript property is constructed in the SDK by concatenating an array of all transcriptions results. But you may only want the transcriptions up to and including the first final result. That way any transcriptions after the first final result are ignored.

(In other words, the processing and networking delay might be long enough that your user continues speaking after the end of a phrase, so you receive additional transcriptions after the final result but before your application has stopped the microphone.)

I'm not sure if that will happen—I haven't had a chance to test it. But I will include a code example in case you run into that issue with the bestTranscript property.

Here's a code example to try:

// start transcribing microphone audio
speechToText.recognizeMicrophone(settings: settings, failure: failure) { results in

    // print all transcription results
    print(results.bestTranscript)

    // print all transcription results up to
    // and including the first `final` result
    var transcription = ""
    for result in results.results {
        if let transcript = result.alternatives.first?.transcript {
            transcription += transcript
        }
        if result.final {
            break
        }
    }

    // detect end of speech event
    for result in results.results {
        if result.final {
            // stop transcribing microphone audio
            self.speechToText.stopRecognizeMicrophone()
        }
    }
}

As you experiment with your application, it may be helpful to refer to our documentation on the SpeechRecognitionResults and SpeechRecognitionResult structs.