[speech-to-text] Redesign SpeechToText implementation

glennrfisher commented 8 years ago

Transition the programming model from a delegate pattern to a completion handler pattern.

General API Functionality:

[x] Discrete audio transcription
[x] Continuous audio transcription with microphone
[x] Continuous audio transcription with AVCaptureOutput

Requested Features:

[x] Minimize connection time
[x] Access data to visualize waveform
[x] Return complete response model
[x] ~~Save sent audio as a file~~ (See #185)
[x] Produce error on audio format mismatch
[x] Update documentation to reflect new implementation
[x] ~~Ensure completion handler is always called, even for silent audio.~~
[x] Execute the completion handler with an error if credentials are invalid.

API Stub:

public class SpeechToText {

    public init() { }

    /**
     Transcribe pre-recorded audio data.

     - parameter audio: The pre-recorded audio data.
     - parameter settings: Settings to configure the SpeechToText service.
     - parameter onInterim: A callback function to execute with interim transcription results from
        the SpeechToText service. This callback function will be executed exactly once for each
        interim transcription result produced by the SpeechToText service. Note that the
        SpeechToText `interimResults` setting must be `true` for the service to return interim
        transcription results.
     - parameter completionHandler: A function that will be executed with all final transcription
        results from the SpeechToText service, or an error if an error occured.
     */
    public func transcribe(
        audio: NSData,
        settings: SpeechToTextConfiguration,
        onInterim: ((SpeechToTextResponse?, NSError?) -> Void)? = nil,
        completionHandler: ([SpeechToTextResponse]?, NSError?) -> Void)
    {
        // 1. Set up SpeechToText with client-specified settings.
        // 2. Send the given audio data to the SpeechToText service.
        // 3. Execute the onInterim function for each interim transcription result.
        // 4. Execute the completionHandler with all final transcription results (or an error).
    }

    /**
     Start the microphone and stream the recording to the SpeechToText service for a live
     transcription. The microphone will stop recording after an end-of-speech event is detected
     by SpeechToText or the stopRecording function is executed.

     - parameter settings: The settings used to configure the SpeechToText service.
     - parameter onInterim: A callback function to execute with interim transcription results from
        the SpeechToText service. This callback function will be executed exactly once for each
        interim transcription result produced by the SpeechToText service. Note that the
        SpeechToText `interimResults` setting must be `true` for the service to return interim
        transcription results.
     - parameter completionHandler: A function that will be executed with all final transcription
        results from the SpeechToText service, or an error if an error occured.

     - returns: A stopRecording function that can be executed to stop the microphone's recording,
        wait for any remaining transcription results to be returned by the SpeechToText service,
        then execute the completionHandler.
     */
    public func transcribe(
        settings: SpeechToTextConfiguration,
        onInterim: ((SpeechToTextResponse?, NSError?) -> Void)? = nil,
        completionHandler: ([SpeechToTextResponse]?, NSError?) -> Void)
        -> StopRecording
    {
        // 1. Set up SpeechToText with client-specified settings.
        // 2. Start the microphone.
        // 3. Stream microphone audio to the SpeechToText service.
        // 4. Execute the onInterim function for each interim transcription result.
        // 5. Continue until:
        //      a. The client executes the stopRecording function, or
        //      b. The SpeechToText service detects an "end of speech" event, or
        //      c. The SpeechToText service times out (either session timeout or inactivity timeout).
        // 6. Execute the completionHandler with all final transcription results (or an error).
    }
}

// StopRecording is a function that can be used to forcibly stop recording the microphone and sending
// audio to the Speech to Text service. (We use a typealias to enhance the expressiveness of our API.)
typealias StopRecording = Void -> Void

glennrfisher commented 8 years ago

AVCaptureAudioDataOutput and AVCaptureAudioDataOutputSampleBufferDelegate will be very helpful for continuous listening.

I'm copying/pasting this code here for reference, to show an example of a simple application that successfully receives streams of buffer data from the microphone.

(Note that there are no hardware devices available through the simulator, so this code only works successfully when executed on a device.)

//
//  ViewController.swift
//  MicrophoneDelegate
//
//  Created by Glenn Fisher on 1/29/16.
//  Copyright © 2016 IBM. All rights reserved.
//

import UIKit
import AVFoundation

class ViewController: UIViewController {

    private var session = AVCaptureSession()

    override func viewDidLoad() {
        super.viewDidLoad()

        // add microphone as input
        let microphoneDevice = AVCaptureDevice.defaultDeviceWithMediaType(AVMediaTypeAudio)
        let microphoneInput = try? AVCaptureDeviceInput(device: microphoneDevice)
        if session.canAddInput(microphoneInput) {
            session.addInput(microphoneInput) }

        // output to sample buffer delegate
        let output = AVCaptureAudioDataOutput()
        let queue = dispatch_queue_create("sample buffer delegate", DISPATCH_QUEUE_SERIAL)
        output.setSampleBufferDelegate(self, queue: queue)
        session.addOutput(output)

        // start microphone capture session
        session.startRunning()
    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }
}

extension ViewController: AVCaptureAudioDataOutputSampleBufferDelegate {

    func captureOutput(
        captureOutput: AVCaptureOutput!,
        didOutputSampleBuffer sampleBuffer: CMSampleBuffer!,
        fromConnection connection: AVCaptureConnection!)
    {
        print("new sample buffer available for processing")
    }
}

vherrin commented 8 years ago

Thanks Glenn!

dsato80 commented 8 years ago

This is great! I'm looking forward to seeing the next version. By the way, do you have any specific date or week v0.2.0 is going to be released?

glennrfisher commented 8 years ago

Hi @dsato80. Glad you're excited for it!

I don't have a specific date, but I will be working on this project full-time for awhile. I'm personally aiming for the end of this week, although it might take me until next week to work out all of the kinks. I'll be sure to post here when it's ready!

I didn't really want to publish any code yet, but had to do so for testing purposes. I'll be open to accepting suggestions and pull requests after merging the speech-to-text branch into the develop branch. We may push a release to master, as well.

I'm looking forward to getting your feedback once it's ready for consumption!

glennrfisher commented 8 years ago

For my personal reference: this documentation may be helpful when designing an example with metering.

https://developer.apple.com/library/ios/documentation/AVFoundation/Reference/AVCaptureAudioChannel_Class/

dsato80 commented 8 years ago

Hi @glennrfisher ,

Yes! currently I'm using the former SDK written in objective-c with some fixes by myself but it is deprecated unfortunately. So I want to shift to this new SDK. I will be able to work for this next week. I hope you will finish the merge this week!

siraustin commented 8 years ago

Is there a way to use AVCaptureAudioDataOutputSampleBufferDelegate continuous listening version now with the current 0.1.1 release or do we need to wait for 0.2.0. Do you have an approximate ETA for 0.2.0?

glennrfisher commented 8 years ago

Hi @siraustin. Unfortunately, it's been a bit delayed, since I was asked to help on a time-critical project last week. I just wrapped that up this afternoon, though, and will be back on the iOS SDK project starting tomorrow.

Sorry to all for the delays. I'm still trying to get this out as quickly as possible without sacrificing code quality.

siraustin commented 8 years ago

Thanks, @glennrfisher -- helping out a group on an iOS project and really wanting to use Watson (continuous) SpeechToText... glad you're back on the project! I hope 0.2.0 comes soon :)

siraustin commented 8 years ago

@glennrfisher is speech to text continuous listening known to be working in v0.1.1? I'm having trouble with it and don't want to bang my head too much if it's not known to be working... Thanks! [update I see the issue document in #147 -- it's not clear if it was resolved or closed in favor of the redesign]

glennrfisher commented 8 years ago

@siraustin: Continuous listening is not working in v0.1.1. Sorry for the trouble.

dsato80 commented 8 years ago

Hi @glennrfisher, this is great work!

I've tested and found a bug in building url for websocket. Please check it out. https://github.com/watson-developer-cloud/ios-sdk/pull/198

glennrfisher commented 8 years ago

Complete with 3fdc2746c754ba246137995e4f66231c3e8e9022

glennrfisher commented 8 years ago

For those of you following this issue, I want to let you know that there is an example application to test and demonstrate the new Speech to Text implementation.

Please liberally open issues to let us know if you find any bugs! We will fix them quickly.

watson-developer-cloud / swift-sdk

[speech-to-text] Redesign SpeechToText implementation #168