noopkat / ms-bing-speech-service

NodeJS service wrapper for Microsoft Speech API and Custom Speech Service
MIT License
82 stars 17 forks source link

One more item in the documentation #23

Closed ricardoatsouza closed 5 years ago

ricardoatsouza commented 6 years ago

First of all, thank you very much for this library! It's just amazingly good compared with the MS official one. I started using it today and already got nice results. So, keep up the nice work 👍!As a thank you, I will see if I can find myself some time to publish a repo with an example of how to use your API. 😄

Anyway, one thing that is missing in the documentation, and would be nice to have, is how to properly use the sendStream function. I manage to figure it out myself (it's not difficult), but would be nice to have it in the docs with some examples. :)

I created this ticket here to point out a small suggestion in the docs, but this is not an actual issue. Rather, it's actually a big Thank You for creating this library.

Cheers!

BkunS commented 6 years ago

@ricardoatsouza I got the sendFile working, but I'm stuck on getting the sendStream working after I get the stream from getUserMedia. Do you have any working examples for sendStream function? Appreciated!

ricardoatsouza commented 6 years ago

Hi @BkunS. I haven't published it yet. I pretend to do it today.

BkunS commented 6 years ago

@ricardoatsouza Cool! Do you mind sharing it?

ricardoatsouza commented 6 years ago

@BkunS Of course I don't mind. Just published it. It's a quite simple example, but should show how to use the library. Let me know if it works for you. :)

Here is the link to it: https://github.com/ricardoatsouza/ms-bing-speech-streaming-example

noopkat commented 6 years ago

Hi everyone!

@ricardoatsouza I am really glad you are enjoying this library! Sorry for such a late response.

Your example repo looks great 👍

I also have an example I created in the docs here but totally forgot to merge it to master I am so sorry 🙈

If you think that example is helpful I'll merge it, otherwise I'd love your feedback!

ricardoatsouza commented 6 years ago

Hi @noopkat!

Thanks for the feedback. I think your example looks good and will be quite helpful for the sendStream part. If you wanna incorporate my repo example as part of yours in the example folder, feel free to do it :)

And, again, thanks for the library. It is super helpful! 👍

noopkat commented 6 years ago

Thanks @ricardoatsouza I’ll fold your example into this repo and will merge in the documentation branch (finally!).

BkunS commented 6 years ago

@noopkat @ricardoatsouza Thanks for the demo, but what I meant for sending stream is, to get audio stream by getUserMedia() and send it to recognition on the fly, something like:

getUserMedia({ audio: true })
    .then(stream => {
        // Tried several recorders here.
        recognizer.sendStream(stream) 
        /* Got error because 'stream' is mediaStream and recognizer accepts readable Stream. 
           I'm confused and stuck on working between different types of stream. */
})

This is my first time to deal with streams and web audio, what piece did I miss here? I've tried several recorder npm packages but none of them will convert and pipe the mediaStream to recognizer as 'readableStream'.

I saw there's a working example from Bing STT official repo: https://github.com/Azure-Samples/SpeechToText-WebSockets-Javascript/tree/master/samples/browser

But it uses SDK directly, so I just wonder how it should be done using this package.

noopkat commented 6 years ago

@BkunS it's unfortunately a little complicated. The official SDK package does a wonderful job of implementing getMediaStream with the websocket stt service in the source code. There's a lot of work to get there.

You have a couple of options:

  1. Use the mediaStream recording API to capture the mediaStream blobs as they are recorded (from the ondataavailable event) and pass those onto a new stream that you create yourself and use with sendStream. This stream you create can be based on the browser polyfill that either Webpack or Browserify can offer if you bundle with either of these tools. You can also use the recognizer.sendChunk method with the ondataavailable event instead to bypass having to create your own stream if you like 😄 . This mediaStream recording API doesn't have super great support in the browser, however this polyfill is useful. I have successfully used this approach before.

  2. The official SDK for Bing STT uses a combination of a PCM encoder and createMediaStreamSource, as you can read in the source code. The authors also implemented their own stream interface as well. That's why I commend them on how much work that took! 😄

Let me know if this makes sense to you. I'm a little slammed with other work stuff lately but I'll see if I can make the time to sit down and produce a working example + source code.

BkunS commented 6 years ago

@noopkat Sorry for getting back so late. I just had the time to try your options, but I still couldn't get it working: I used msr.ondataavailable as you suggested and the default minetype seems not working, so I set it to 'audio/wav' (not sure if this is the right one either), but at least it would give different errors depending on the recognizer's methods:

mediaRecorder.start();
mediaRecorder.ondataavailable = function(blob) {
 - recognizer.sendFile(blob) // Unhandled promise rejection Error: could not send File: not a valid ArrayBuffer
 - recognizer.sendStream(blob) // Unhandled promise rejection TypeError: Object doesn't support property or method 'on'
 - recognizer.sendChunk(blob) // Could not start service: TypeError: First argument must be a string, Buffer, ArrayBuffer, Array, or array-like object.
}

Let me know if there's anything that I did wrong.

noopkat commented 6 years ago

@BkunS no worries!

audio/wav is the mime type needed.

Could you please try the following to convert the blob into an array buffer in order to use sendChunk?

mediaRecorder.start();
mediaRecorder.ondataavailable = function(blob) {
  var fileReader = new FileReader();
  fileReader.onload = function(event) {
    recognizer.sendChunk(event.target.result); 
  }
  fileReader.readAsArrayBuffer(blob);
}
BkunS commented 6 years ago

@noopkat Just tried it, it's finally working! Thank you so much!!!

I noticed that the recognition accuracy is not as high as sending the whole file (I guess it's because the chunks make many word's audio incomplete). I tried to increase the buffer size to max (16384 is highest in msr's doc), it became much better but still not as accurate as sending the whole file. Would there be any other way to improve it?

Also, I've noticed that it only works in Firefox, though Firefox would still show this error:

this.telemetry.Metrics.filter(...).pop(...) is undefined
MsBingSpeechService.js:7

In Chrome(67), sending chunk will show error like this: failed: Error during WebSocket handshake: Sent non-empty 'Sec-WebSocket-Protocol' header but no response was received

Edge(42) has different error:

Unable to set property 'End' of undefined or null reference
MsBingSpeechService.js (6,1)

Are these something you could look into it?

noopkat commented 6 years ago

Hi @BkunS,

I am so happy to hear that everything started working! We're almost there.

As for the accuracy, yes I have observed this when comparing live speech vs a static file buffer. Upping the chunk size to the maximum is really the only thing I can think of right now to improve it, and you have tried that as you said.

As for the errors, these are ones that I can definitely look into. Firefox is definitely the most compatible as far as I know, but I don't see why I couldn't resolve the errors you documented for me. Thanks for all for the details - this helps me a lot!

I'll keep you posted on this, thanks for your patience 🙇‍♀️

bitmoji

noopkat commented 5 years ago

Hi @ricardoatsouza and @BkunS

Thanks for working with me on this issue. I didn't end up being able to resolve the strange browser errors.

Since then, there's been a new official version of a NodeJS (and browser environment!) supported SDK for the unified Microsoft Speech Services (formally Bing Speech Service). I'd recommend checking that out instead, if you still have need for a library such as this: https://github.com/Azure-Samples/cognitive-services-speech-sdk

Therefore I'll respectfully close this issue and will be deprecating / archive this repo 🙇‍♀ Thanks again for your contribution.