microsoft / BotBuilder-RealTimeMediaCalling

BotBuilder-RealTimeMediaCalling extends the BotBuilder to enable bots to engage in Skype audio-video calling. It provides real-time, programmable access to the voice, video, and screen sharing streams of a Skype call. The bot is a direct participant in a Skype 1:1 call.
MIT License
76 stars 36 forks source link

Support for Microsoft.CognitiveServices.Speech #46

Open scberr opened 5 years ago

scberr commented 5 years ago

It's no longer possible to add Bing Speech subscriptions to my Azure subscription as it has been retired in favour of Microsoft.CognitiveServices.Speech.

As a result I can't get the code working. When I plug the new engine in, everything seems to work, but it doesnt ever recognise any speech.

Happy to help sort this out.

adityaramgopal commented 5 years ago

While I might not know the internals of how Microsoft.CognitiveServices.Speech / Bing Speech works, I can try taking a look if you could share how you're sending the audio buffers to Microsoft Cognitive Services speech. Was there an API breaking change when switching from Bing speech to CognitiveServices?

scberr commented 5 years ago

Yes, there is a new breaking C# API you need to switch to which is different (it preserves the REST schema, but has a different endpoint).

It seems to eliminate a lot of the solution code as the API now takes care of more.

I finally got it working late last night, but the code is now very messy. Had to rewrite a few methods, but put in a heap of extra tracing and debug logic to isolate the cause.

I can clean it up in the next day or so and send it through. Do you want me to do it as a pull request or just paste the changes in here?

adityaramgopal commented 5 years ago

If you can paste code samples of the following it would be great:

  1. How you feed the received audio buffers into the Cognitive services C# API.
  2. How you subscribe to recognition/session events
  3. How you stop sending buffers and indicate that you want to stop recognition
scberr commented 5 years ago

Here's the new StartRecognition method (the version that works) I think the root problem was that I was missing the line "Task.WaitAny(new[] { stopRecognition.Task });"

`

    private void StartSpeechRecognitionv2()
    {
        var stopRecognition = new TaskCompletionSource<int>();

        var config = SpeechConfig.FromSubscription(Service.Instance.Configuration.SpeechSubscription, "southeastasia");

        _recognitionStream = AudioInputStream.CreatePushStream();

        Task.Run(async () =>
        {
            //do
            //{
            try
            {
                // Creates a speech recognizer.
                using (var recognizer = new SpeechRecognizer(config, AudioConfig.FromStreamInput(_recognitionStream)))
                {
                    Log.Info(new CallerInfo(), LogContext.Media, $"[{this.Id}]: Setting up speech recognition");
                    recognizer.Recognizing += Recognizer_Recognizing;
                    recognizer.SessionStarted += Recognizer_SessionStarted;
                    recognizer.SessionStopped += Recognizer_SessionStopped;
                    ///recognizer.Recognized += Recognizer_Recognizing;
                    Log.Info(new CallerInfo(), LogContext.Media, $"[{this.Id}]: Starting continuous recognition.");

                    await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

                    // Waits for completion.
                    // Use Task.WaitAny to keep the task rooted.
                    Task.WaitAny(new[] { stopRecognition.Task });

                    // Stops recognition.
                    await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
                }

                Log.Info(new CallerInfo(), LogContext.Media, $"[{this.Id}]: Speech recognize completed.");

            }
            catch (Exception exception)
            {
                Log.Error(new CallerInfo(), LogContext.Media, $"[{this.Id}]: Speech recognize threw exception {exception.ToString()}");
            }
        }
        ).ForgetAndLogException(string.Format("Failed to start the SpeechRecognition Task for Id: {0}", Id));

    }` 
scberr commented 5 years ago

OnAudioMediaReceived now looks like this:

`

   private void OnAudioMediaReceived(object sender, AudioMediaReceivedEventArgs e)
    {
        CorrelationId.SetCurrentId(_correlationId);
        Log.Verbose(
            new CallerInfo(),
            LogContext.Media,
            "[{0}] [AudioMediaReceivedEventArgs(Data=<{1}>, Length={2}, Timestamp={3}, AudioFormat={4})]",
            this.Id,
            e.Buffer.Data.ToString(),
            e.Buffer.Length,
            e.Buffer.Timestamp,
            e.Buffer.AudioFormat);

        byte[] buffer = new byte[e.Buffer.Length];
        Marshal.Copy(e.Buffer.Data, buffer, 0, (int)e.Buffer.Length);

        //If the recognize had completed with error/timeout, the underlying stream might have been swapped out on us and disposed.
        //so ignore the objectDisposedException 
        try
        {
            this._recognitionStream.Write(buffer);
        }
        catch (ObjectDisposedException)
        {
            Log.Info(new CallerInfo(), LogContext.Media, $"[{this.Id}]: Write on recognitionStream threw ObjectDisposed");
        }
        catch (Exception ex)
        {
            Log.Error(new CallerInfo(), LogContext.Media, $"[{this.Id}]: Caught an exception while processing the audio buffer {ex.ToString()}");
        }
        finally
        {
            e.Buffer.Dispose();
        }
    }

`

adityaramgopal commented 5 years ago

Good to know it works. Thanks for sharing. As you said, you have to wait for stopRecognition.Task to complete before calling StopContinuousRecognitionAsync. I presume you mark this task as complete when you decide you no longer want to listen to incoming buffers, correct?

scberr commented 5 years ago

Yes, that will end it. I keep recognising till the call drops.

I'm not sure if it is being cleaned up right on a call end though. Do I need to add anything to .Dispose?

Also (while I don't need to know technical details on how) is it possible to transfer the call? And is it also possible to give the bot a public phone number?

adityaramgopal commented 5 years ago

I believe you might need to Close the AudioInputStream before stopping recognition but that's something you should check with the Cognitive Services Speeck SDK team: https://github.com/Azure-Samples/cognitive-services-speech-sdk