noopkat / ms-bing-speech-service

NodeJS service wrapper for Microsoft Speech API and Custom Speech Service
MIT License
82 stars 17 forks source link

from docs, sendStream should be on "recognizer" instead of "service" #6

Closed filljoyner closed 5 years ago

filljoyner commented 6 years ago

Hi! First off, thank you so much for creating this! I've tried to implement Microsoft's Speech to Text service in node several times using their various examples and failed miserably. I really appreciate your efforts.

In the docs, sendStream is shown to be a part of the service object but it instead should be a method on the base class (BingSpeechService.prototype.sendStream in BingSpeechService.js).

Thanks again for the great wrapper! Phil

noopkat commented 6 years ago

Hi @filljoyner,

Oof, great catch. Thank you for raising this and I'll update the docs. I'm curious to hear your opinion on whether or not this should remain on the base class or not. It might be worth also supporting it on the service. Thoughts?

I am glad to hear you appreciate this library. It took me a long time to figure out how to go about writing it, and I had to reference a lot of examples and read the protocol spec many times. I struggled with the official JavaScript SDK, so I figured others would too. This Node SDK is still a little rough around the edges, so I welcome feedback if you use it enough to notice bumpy experiences here and there 😄

Thanks again!

filljoyner commented 6 years ago

Hi @noopkat! I'll bet it did! I spent a lot of time with the silly Speech to Text Azure Web Sockets repo and it is really terrible. I'm not sure how anyone writing a modern app would implement that in a constructive way. Microsoft has great cognitive products but their packages are really bad. I used Watson for awhile purely because of the ease of implementation but MS' Speech to Text translations are significantly better.

Figuring out the package API is always a tricky thing. I could give you a few thoughts but these may be informed more by my specific use case (build a helper bot in Electron which will send transcribed text to a Laravel API). Here's my current implementation.

const WaveRecorder = window.require('wave-recorder');
const SpeechService = window.require('ms-bing-speech-service');

class SpeechToText {
    constructor(key) {
        this.audioContext = null;
        this.listening = false;
        this.recognizer = new SpeechService({
            language: 'en-US',
            subscriptionKey: key,
            mode: 'interactive'
        });
    }

    listen(callback) {
        // audio stream stuff
        this.audioContext = new AudioContext();

        navigator.webkitGetUserMedia({audio:true}, (stream) => {
            // get mic input
            let audioInput = this.audioContext.createMediaStreamSource(stream);

            // create the recorder instance
            this.recorder = WaveRecorder(this.audioContext, {
                channels: 1,
                bitDepth: 16,
                silenceDuration: 1
            });
            audioInput.connect(this.recorder.input);

            this.recognizer.start((error, service) => {
                if (!error) {
                    console.log('service started');

                    service.on('turn.start', () => {
                        this.begin();
                    });
                    service.on('turn.end', () => {
                        this.end();
                    });

                    service.on('recognition', (message) => {
                        callback(message, 'recognition');
                    });

                    service.on('speech.hypothesis', (message) => {
                        callback(message, 'hypothesis');
                    });
                }

                this.recognizer.sendStream(this.recorder);
            });

            setTimeout(() => {
                this.end();
            }, 10000);

        }, (response) => {
            console.log(response);
        })

    }

    begin() {
        this.listening = true;
    }

    end(callback) {
        if(this.listening) {
            this.recorder.end();
            this.audioContext.close();
            this.recognizer.stop(callback);
            this.listening = false;
            console.log('stopped');
        }
    }
}

module.exports = SpeechToText;

I use the class like so:

const SpeechToText = require('../../Utilities/SpeechToText');

let stt = new SpeechToText('MY_AZURE_KEY');

stt.listen((message, type) => {
    if (type == 'recognition') {
        // i'm use a vue event bus but any callback could be used here
        this.$bus.$emit('SpeechToText-ListenRecognition', message);
    } else {
        // i'm use a vue event bus but any callback could be used here
        this.$bus.$emit('SpeechToText-ListenHypothesis', message)
    }
});

I'm not sure about recognizer.start as it isn't actually starting and is more of a configuration step since I'm setting up some listeners/callbacks. Because sendStream is a part of the recognizer, it feels a little weird to reference the recognizer again within this function to sendStream but I probably could have done this outside of the anonymous function attached to the recognizer.start.

It would be neat if it was all chainable. So maybe something like the following (this was written in the github comment so pardon any errors).

recognizer.boot((error, service) => {
    if(!error) {
        // all the service stuff
    }
}).sendStream(myStreamBuffer);

If their is an error, it could be caught in the boot stage and sendStream could still be chainable but not actually do anything. That said, this may not really be the best idea as I'm only using a minimal amount of the package's features.

I hope this in some small way helps. Of course everything above is just an idea. You had a mountain to overcome just to get the thing to work! Well done!

If you're open to it, I'd be happy to offer up some ideas in code in some time. I need to get this bot idea out of my head first! Phil

noopkat commented 6 years ago

this is really great feedback. And thanks for sharing your implementation, as that's really insightful for me to see how it's being used. I completely agree that the chainable scenario feels very ergonomic - I'd never thought of doing it this way. My gut feel is that most folks are going to be using this for one of two scenarios:

  1. sending static files from the hard drive to transcribe
  2. sending a live stream of speech from either webRTC, ffmpeg, system level audio source etc etc

I think that your chaining idea suits both use cases. I'm interested in how the event listeners fit in there, but I'll let you get your project done first and let me know if you've still got the energy to talk further implementation details. I'm not opposed to a version 2 bump if the existing interface would be a big fuss to keep support for.

I really needed folks to start using this first to see what made sense for the shape of the public interface, so I'm excited that real usage has happened so soon. I have a similar package for the Microsoft Translator websocket API and have been in touch recently with a colleague who has been using it. I think gathering their feedback will also be valuable, given both packages have a similar architectural design.

Anyway no pressure and thanks again for your insights and kindness 😄 I'd love to hear more about your project once it's at a point you're happy with.

filljoyner commented 6 years ago

Happy to help! Sending files from disk and streaming both sound like the most common use cases. I'm starting to rethink chaining now though as both sendFile and sendStream are dependent on a successful connection to the service. Throwing an error in the attempt to sendFile and sendStream doesn't sound right as the error should be thrown in start or boot and not continue on to the next step. It isn't an async scenario as the connection needs to be made first so a promise like implementation isn't the right fit here either. That one will take some thinking through.

I listened to a podcast a few years ago where developers were discussing adding a new feature to an existing framework and was impressed by the considerable amount of time spent thinking through the right terminology and feel for the interface before jumping into code. It's an interesting idea and when I moved over to TDD (for server-side languages) I began to see the value as you write implementation based on how you want to interact with the class/package/etc and then go back and build out the plumbing to make the implementation work and iterate until happy.

I'll give it a little more thought while I'm working on the bot and let you know if something comes to mind. Once I get far enough along I'll post the code and it would be great if you could take a look!

noopkat commented 5 years ago

Hi @filljoyner,

I really enjoyed chatting in this issue about architecture 😄

Since then, there's been a new official version of a NodeJS supported SDK for the unified Microsoft Speech Services (formally Bing Speech Service). I'd recommend checking that out instead, if you still have need for a library such as this: https://github.com/Azure-Samples/cognitive-services-speech-sdk

Therefore I'll respectfully close this issue and will be deprecating / archive this repo 🙇‍♀ Thanks again for your contribution.