watson-developer-cloud / swift-sdk

:iphone: The Watson Swift SDK enables developers to quickly add Watson Cognitive Computing services to their Swift applications.
https://watson-developer-cloud.github.io/swift-sdk/
Apache License 2.0
877 stars 222 forks source link

Insufficient data for audio stream #461

Closed troyibm closed 7 years ago

troyibm commented 8 years ago

Testing out 0.8.0 STT approaches and found that SpeechToText.recognizeMicrophone()/stopRecognizeMicrophone() approach with Network Link Conditioner set to 3G preset, I would get these errors:

(Note, I've also seen this error when using the "Session Management and Advanced Features" approach to.)

Error Domain=com.ibm.watson.developer-cloud.SpeechToTextV1 Code=0 "Stream was 66 bytes but needs to be at least 100 bytes." UserInfo={NSLocalizedFailureReason=Stream was 66 bytes but needs to be at least 100 bytes.} Error Domain=WebSocket Code=1011 "see the previous message for the error details." UserInfo={NSLocalizedDescription=see the previous message for the error details.} Sep 28, 2016, 9:31 AM: transcribe voice Error Domain=com.ibm.watson.developer-cloud.SpeechToTextV1 Code=0 "Stream was 70 bytes but needs to be at least 100 bytes." UserInfo={NSLocalizedFailureReason=Stream was 70 bytes but needs to be at least 100 bytes.} Error Domain=WebSocket Code=1011 "see the previous message for the error details." UserInfo={NSLocalizedDescription=see the previous message for the error details.}

This problem is intermittent and when I get this error, I continually get it.

glennrfisher commented 8 years ago

Sorry for the delay, @troyibm. We usually try to respond sooner but have been very busy here.

This is an interesting error and I'm not quite sure what is causing it, but I hope to find some time this week or next to try reproducing this bug and finding a fix.

Are you using Opus compression? I wonder if the bug is from this line that forces the ogg/opus bitstream to be flushed each time there is new data from the microphone--perhaps the bitstream, even after adding additional microphone data, doesn't contain more than 100 bytes?

troyibm commented 8 years ago

Yes, we are using opus compression.

troyibm commented 8 years ago

@glennrfisher any solution to this problem. We still have it and are basically telling the user "something went wrong, please speak again" :(

vherrin commented 8 years ago

@troyibm - Is this still a problem? We will close the issue if it is no longer an issue.

glennrfisher commented 8 years ago

If you do run into this problem again, @troyibm, can you try the no-force-flush branch? It's possible that this issue is caused by flushing the Ogg stream--perhaps there is less than 100 bytes of compressed microphone data.

Here's how to update your Cartfile to try the no-force-flush branch:

github "watson-developer-cloud/ios-sdk" "no-force-flush"
esilky commented 8 years ago

I run into this error quite often when testing the application. I typically have logging on, and can provide any information that the application it logging. What information is the most important to you? The log output is only available on the screen - so I will need to take/post screen-shots.

glennrfisher commented 8 years ago

@esilky: Did you try the no-force-flush branch?

We would need enough information to recreate the problem, since we haven't seen it when testing with our simple-chat-objective-c application.

Is there a repository you could provide us access to? Ideally we would be able to run your application locally to see the problems you're running into.

Here are some questions I have. Feel free to add more information (from the logs, for example).

mina03 commented 8 years ago

@glennrfisher We haven't tried the no-force-flush branch as yet.

Following are the answers to your questions :

esilky commented 8 years ago

I spent about 2 hours asking questions yesterday. This error occurred roughly 1 out of 10 questions. As mentioned above, it didn't seem to depend on the question.

glennrfisher commented 8 years ago

Thanks for the information, @mina03 and @esilky! That gives me a better idea of what to investigate further.

If you have the time, feel free to give the no-force-flush branch a try. But since you said you ran into this error with both Opus and WAV formats, I don't think that will fix the problem. (The no-force-flush branch changes the configuration for how we construct an Ogg/Opus stream.)

I will try to squeeze-in time to take another look at this issue today.

glennrfisher commented 8 years ago

I'm still not able to replicate the issue.

I created a quick application to try it out. It just has one button that starts/stops transcribing. I tried multiple executions of the application with 20 transcriptions in each and did not run into any problems. Here is a video to demonstrate how I was testing the application: SpeechToTextTest.zip

Here's the main code of the application, from ViewController.swift:

class ViewController: UIViewController {

    private let speechToText = SpeechToText(username: "...", password: "...")
    private let failure = { (error: NSError) in print("*** error: " + error.description) }

    override func viewDidLoad() {
        super.viewDidLoad()
    }

    @IBAction func didPressTranscribe(sender: UIButton) {

        if sender.currentTitle == "Start Transcribing" {

            var settings = RecognitionSettings(contentType: .Opus)
            settings.interimResults = true
            settings.continuous = true
            settings.inactivityTimeout = -1

            speechToText.recognizeMicrophone(settings, compress: true, failure: failure) { results in
                print("*** best: " + results.bestTranscript)
                print("*** results: " + results.results.description)
            }

            sender.setTitle("Stop Transcribing", forState: .Normal)

        } else {

            speechToText.stopRecognizeMicrophone()

            sender.setTitle("Start Transcribing", forState: .Normal)

        }
    }
}

Here are all of the statements printed to the console:

*** best:  they 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "they ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test number one 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test number one ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test number one of one 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test number one of one ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test number one of Watson 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test number one of Watson ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test number one of Watson's speech 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test number one of Watson\'s speech ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test number one of Watson's speech to text 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test number one of Watson\'s speech to text ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test number one of Watson's speech to text 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test number one of Watson\'s speech to text ", confidence: Optional(0.83200000000000007), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** error: Error Domain=com.ibm.watson.developer-cloud.SpeechToTextV1 Code=0 "Stream was 70 bytes but needs to be at least 100 bytes." UserInfo={NSLocalizedFailureReason=Stream was 70 bytes but needs to be at least 100 bytes.}
*** error: Error Domain=WebSocket Code=1011 "see the previous message for the error details." UserInfo={NSLocalizedDescription=see the previous message for the error details.}
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test three 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test three ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test three 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test three ", confidence: Optional(0.84800000000000009), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just for 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just for ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just for 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just for ", confidence: Optional(0.87700000000000011), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  the 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "the ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test five 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test five ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test five 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test five ", confidence: Optional(0.93900000000000006), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is tests 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is tests ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test six 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test six ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test six 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test six ", confidence: Optional(0.76300000000000001), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is tests 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is tests ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is tests of 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is tests of ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is tests seven 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is tests seven ", confidence: Optional(0.75700000000000012), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is to 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is to ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test eight 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test eight ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test eight 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test eight ", confidence: Optional(0.88400000000000001), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test now 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test now ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test nine 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test nine ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test nine 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test nine ", confidence: Optional(0.7380000000000001), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just ten 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just ten ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just ten 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just ten ", confidence: Optional(0.68600000000000005), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is to 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is to ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test a lot 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test a lot ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test eleven 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test eleven ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test eleven 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test eleven ", confidence: Optional(0.91600000000000004), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test well 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test well ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test wells 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test wells ", confidence: Optional(0.45200000000000001), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  the system 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "the system ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test thirty 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test thirty ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test thirteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test thirteen ", confidence: Optional(0.89100000000000001), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test for two 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test for two ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test fourteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test fourteen ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test fourteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test fourteen ", confidence: Optional(0.84200000000000008), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test fifty 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test fifty ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just fifteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just fifteen ", confidence: Optional(0.82600000000000007), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is tests 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is tests ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test sixteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test sixteen ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test sixteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test sixteen ", confidence: Optional(0.83800000000000008), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just seventy 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just seventy ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just seventeen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just seventeen ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is just seventeen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is just seventeen ", confidence: Optional(0.85400000000000009), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is to 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is to ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test eight 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test eight ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test eighteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test eighteen ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is test eighteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is test eighteen ", confidence: Optional(0.8660000000000001), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is to 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is to ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is because ninety 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is because ninety ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is because nineteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is because nineteen ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  this is because nineteen 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "this is because nineteen ", confidence: Optional(0.81700000000000006), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  in the 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "in the ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  and this is just 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "and this is just ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  and this is just twenty 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: false, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "and this is just twenty ", confidence: nil, timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]
*** best:  and this is just twenty 
*** results: [SpeechToTextV1.SpeechRecognitionResult(final: true, alternatives: [SpeechToTextV1.SpeechRecognitionAlternative(transcript: "and this is just twenty ", confidence: Optional(0.83600000000000008), timestamps: nil, wordConfidence: nil)], keywordResults: nil, wordAlternatives: nil)]

Any idea what's different between this application and yours?

It's worth noting that in iOS 10, this application was crashing until I added a Privacy - Microphone Usage Description property to the application's Info.plist file.

esilky commented 8 years ago

@troyibm @mina03 - any idea? I am still seeing this error when I test. Probably 1 in 10-12 questions I ask.

mina03 commented 8 years ago

@glennrfisher

  1. One difference I noticed is you are sending an additional compress param. I shall try with this param.
  2. We are using Long Tap. So on long tap .begin we start recognition and on long tap .end we stop recognition. Is it possible that this leads to abrupt stop of recognition? Due to which the byte stream sent in incorrect?
mina03 commented 8 years ago

@glennrfisher Tried with additional compress param. I still get the error.

troyibm commented 8 years ago

@glennrfisher you did run into the problem once in your test on comment https://github.com/watson-developer-cloud/ios-sdk/issues/461#issuecomment-254928172

see that it has: *\ error: Error Domain=com.ibm.watson.developer-cloud.SpeechToTextV1 Code=0 "Stream was 70 bytes but needs to be at least 100 bytes." UserInfo={NSLocalizedFailureReason=Stream was 70 bytes but needs to be at least 100 bytes.}

mina03 commented 8 years ago

@glennrfisher I am able to get this error with no-force-flush branch as well.

germanattanasio commented 8 years ago

@daniel-bolanos can you help with this ^^. Seems like when we send the chunks of data in the WebSocket connection we get the error above.

Seems like if you Google the error message you get some links from people having the same issue with other SDKs.

daniel-bolanos commented 8 years ago

"Stream was 70 bytes but needs to be at least 100 bytes" <- can you just send a bigger binary message? why 70 bytes only?

glennrfisher commented 8 years ago

@daniel-bolanos I haven't been able to track down the state of the SDK when a <100 byte message is sent. So I'm not sure what circumstance causes such a small stream of data to be sent.

Here's my hunch for what's causing the problem, though:

We encode pcm data to opus so long as there is enough data for a complete frame. If there isn't enough pcm data to construct a complete opus frame, then we cache the remainder until the next microphone interrupt. When the user wants to stop streaming microphone audio, we encode a final opus frame with any remaining pcm cache, where the cache is padded with zeroes to create enough data for a final opus frame.

Since there's a lot of redundant data in such a final frame (i.e. lots of zeros, which can be encoded in just a few bytes using a run-length encoding), the encoded opus frame may actually be less than 100 bytes.

The problem with that hunch, though, is that @mina03 claims to see the same message when using either Opus or PCM encoding...

daniel-bolanos commented 8 years ago

@glennrfisher , my advice is that, if you have a chunk smaller than 100 bytes before signalling end of audio, just drop it, there can be much speech on that, I think it is safe to just drop, have you tried that? How many milliseconds of speech is 100bytes?

glennrfisher commented 8 years ago

Looking back at the results of our test, it's possible that would have caused some speech to be dropped.

Here's a successful set of responses:

this is 
this is test 
this is test three
final: this is test three

And here is the test that failed:

this is 
this is test 
Error Domain=com.ibm.watson.developer-cloud.SpeechToTextV1 Code=0 "Stream was 70 bytes but needs to be at least 100 bytes." UserInfo={NSLocalizedFailureReason=Stream was 70 bytes but needs to be at least 100 bytes.}
Error Domain=WebSocket Code=1011 "see the previous message for the error details." UserInfo={NSLocalizedDescription=see the previous message for the error details.}

So either: (1) the 70-byte stream included the speech for "two", or (2) an opus frame with the speech for "two" was in-flight, along with a shorter 70-byte frame, and the service responded with an error before processing and returning results for the "two" frame.

Do you have any insights, @daniel-bolanos? We can try modifying the SDK to drop <100 byte frames, but not sure if we have the time to make those changes and tests today.

daniel-bolanos commented 8 years ago

Hi @glennrfisher, I believe dropping the last frame if smaller than 100 bytes is safe, regarding not affecting recognition performance. What is the bitrate you are using, if using 64kbits per second, 70 bytes is nothing.

glennrfisher commented 8 years ago

We can do that, but I'm not entirely comfortable with it--especially without further testing to see exactly what data is being dropped. It's possible that those 100 bytes contain some critical information about the end of a stream (e.g. it could be an Ogg page with the EOS flag and no audio data--that would be about 28 bytes). Without the final Ogg page, the Ogg/Opus file would be corrupt.

Does the server need to produce an error when sent data that is less than 100 bytes? Are there issues that you think would crop up if the minimum is removed?

On Fri, Oct 21, 2016 at 3:57 PM Daniel Bolanos notifications@github.com wrote:

Hi @glennrfisher https://github.com/glennrfisher, I believe dropping the last frame if smaller than 100 bytes is safe, regarding not affecting recognition performance. What is the bitrate you are using, if using 64kbits per second, 70 bytes is nothing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/watson-developer-cloud/ios-sdk/issues/461#issuecomment-255463380, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3fBLXrpR7-GliNqtI9yTKBmnlqgVJaks5q2SdIgaJpZM4KI6LD .

daniel-bolanos commented 8 years ago

I hear you Glenn, let me investigate this...

daniel-bolanos commented 8 years ago

Hi @glennrfisher , talking to @jfigura I learned that we definitely support smaller chunks, even 1 byte. This message is triggered only if the whole utterance was less than 100 bytes. Maybe you guys are sending the end of stream accidentally, in WS interface, an empty blob is considered as end of stream (same as {action: stop}).

glennrfisher commented 7 years ago

Ah, great idea @daniel-bolanos. I ran some quick tests and agree that we're probably sending an empty blob somehow. Maybe there's an edge case where there's not enough opus data to fill an ogg page and we end up trying to send a zero-byte ogg page to the service.

For what it's worth, here are the tests that I tried.

Send audio in 50-byte chunks

This worked just fine, corroborating @jfigura that chunks smaller than 100 bytes are supported.

speechtotext-chunks

Send a single 50-byte chunk

This timed out, as expected, since we never close the connection. This wasn't a terribly useful test, but helpful to see that the "Stream was xx bytes but need to be at least 100 bytes" error didn't occur.

speechtotext-timeout

Send a single 50-byte chunk then a 0-byte chunk

This causes the 100-byte error, as expected. The results here also look a lot like what our users are reporting. So I suspect the issue is with a rogue 0-byte chunk, as @daniel-bolanos mentioned.

speechtotext-emptychunk

glennrfisher commented 7 years ago

I just submitted a pull request that ensures we don't send a data payload of zero bytes. I think it should fix this issue--I sat in a conference room and must have started/stopped transcribing about a hundred times and didn't see any errors.

Since I think that pull request fixes this issue, I'm going to go ahead and close it. Since the issue is intermittent, though, there's a chance that I just didn't see it in my testing. So if anyone runs into it again, please feel free to reopen this issue. Thanks!

germanattanasio commented 7 years ago

🍻

germanattanasio commented 7 years ago

selfie-0