microsoftgraph / microsoft-graph-comms-samples

Microsoft Graph Communications Samples
MIT License
208 stars 225 forks source link

Issue with Unmixed Audio in Psi and Echo Bots #695

Open minimaximus opened 8 months ago

minimaximus commented 8 months ago

So I'm trying to create a STT solution that's part of a Teams Bot. The Bot takes part of the meeting and listens to participants' audio. The audio is set to Unmixed = true so that each speaker gets their own channel. The solution is in C#. I tried this with EchoBot and PsiBot, with similar results. I upgraded both solutions to the latest Graph and Skype.Media and .NET SDKs.

I receive the separate audio buffers in real-time, and I send them to a Cognitive Services class to recognize the speech. Every participant gets their own Cognitive Services recognizer class instance.

My problem is that the messages get to the recognizer, and you can see the recognizing event fire consistently. However, the final recognized event only fires sporadically, even with long periods of silence. I'm not sure where the issue is. Any help is greatly appreciated.

Here's where the audio is sent to langServices:

if (audioFrame.UnmixedAudioBuffers != null) { var tasks = new List(); foreach (var buffer in audioFrame.UnmixedAudioBuffers) { var length = buffer.Length; var data = new byte[length]; Marshal.Copy(buffer.Data, data, 0, (int)length);

 var participant = CallHandler.GetParticipantFromMSI(this.callHandler.Call, buffer.ActiveSpeakerId);
 var identity = CallHandler.TryGetParticipantIdentity(participant);
 if (identity != null)
 {

     buffers.Add(identity.Id, (new AudioBuffer(data, audioFormat), audioFrameTimestamp));

     //send to Cognitive Services to transcript
     if (!langServices.ContainsKey(identity))
     {
         langServices.Add(identity, new CognitiveServicesService(identity, this.callHandler.botConfiguration, logger, this.callHandler.Call.Id));
     }

     tasks.Add(langServices[identity].AppendAudioBuffer(data));

     //try a new instance every time!
     //var c = new CognitiveServicesService(identity, this.callHandler.botConfiguration, logger, this.callHandler.Call.Id);
     //tasks.Add(c.AppendAudioBuffer(data));
 }
 else
 {
     this.logger.Warn($"Couldn't find participant for ActiveSpeakerId: {buffer.ActiveSpeakerId}");
 }

} await Task.WhenAll(tasks); }

Here are the relevant snippets from CognitiveServices class:

public CognitiveServicesService(Identity identity, BotConfiguration settings, IGraphLogger logger, string callId) { _logger = logger; _callId = callId; _identity = identity;

_speechConfig = SpeechConfig.FromSubscription(settings.SpeechConfigKey, settings.SpeechConfigRegion); _speechConfig.SpeechSynthesisLanguage = settings.BotLanguage; _speechConfig.SpeechRecognitionLanguage = settings.BotLanguage;

//_speechConfig.SetProperty(PropertyId.Speech_SegmentationSilenceTimeoutMs, "1000"); //_speechConfig.SetProperty(PropertyId.SpeechServiceConnection_InitialSilenceTimeoutMs, "1000");

var audioConfig = AudioConfig.FromStreamOutput(_audioOutputStream); }

public async Task AppendAudioBuffer(byte[] audioBuffer) { //RealtimeTranscriptionHelper.TranscribeAsync(audioBuffer, _speechConfig, _logger); if (!_isRunning) { Start(); await ProcessSpeech(); }

try { _audioInputStream.Write(audioBuffer); } catch (Exception e) { _logger.Log(System.Diagnostics.TraceLevel.Info, e, "Exception happend writing to input stream"); } }

private async Task ProcessSpeech() { try {

 var stopRecognition = new TaskCompletionSource<int>();

 using (var audioInput = AudioConfig.FromStreamInput(_audioInputStream))
 {
     if (_recognizer == null)
     {
         _logger.Log(System.Diagnostics.TraceLevel.Info, "init recognizer");
         _recognizer = new SpeechRecognizer(_speechConfig, audioInput);

     }
 }

 _recognizer.SpeechStartDetected += async (s, e) =>
 {
     Console.WriteLine($"Speech Start Detected. Offest: {e.Offset}");
 };

 _recognizer.SpeechEndDetected += async (s, e) =>
 {
     Console.WriteLine($"Speech End Detected. Offest: {e.Offset}");
 };

 _recognizer.Recognizing += (s, e) =>
 {
     string msg = $"RECOGNIZING: Text={e.Result.Text}";
     _logger.Log(System.Diagnostics.TraceLevel.Info, msg);
     Console.WriteLine(msg);
 };

 _recognizer.Recognized += async (s, e) =>
 {
     if (e.Result.Reason == ResultReason.RecognizedSpeech)
     {
         if (string.IsNullOrEmpty(e.Result.Text))
             return;

         // We recognized the speech
         var msg = $"'timestamp': '{DateTime.Now}', 'speaker': '{_identity.DisplayName}', 'text': '{ e.Result.Text}'";

         Console.WriteLine(msg);
         _logger.Log(System.Diagnostics.TraceLevel.Info, $"***Recognized***: {msg}");

     }
     else if (e.Result.Reason == ResultReason.NoMatch)
     {
         _logger.Log(System.Diagnostics.TraceLevel.Info, $"NOMATCH: Speech could not be recognized.");
     }
 };

 _recognizer.Canceled += (s, e) =>
 {
     _logger.Log(System.Diagnostics.TraceLevel.Info, $"CANCELED: Reason={e.Reason}");

     if (e.Reason == CancellationReason.Error)
     {
         _logger.Log(System.Diagnostics.TraceLevel.Info, $"CANCELED: ErrorCode={e.ErrorCode}");
         _logger.Log(System.Diagnostics.TraceLevel.Info, $"CANCELED: ErrorDetails={e.ErrorDetails}");
         _logger.Log(System.Diagnostics.TraceLevel.Info, $"CANCELED: Did you update the subscription info?");
     }

     stopRecognition.TrySetResult(0);
 };

 _recognizer.SessionStarted += async (s, e) =>
 {
     _logger.Log(System.Diagnostics.TraceLevel.Info, "\nSession started event.");
 };

 _recognizer.SessionStopped += (s, e) =>
 {
     _logger.Log(System.Diagnostics.TraceLevel.Info, "\nSession stopped event.");
     _logger.Log(System.Diagnostics.TraceLevel.Info, "\nStop recognition.");
     stopRecognition.TrySetResult(0);
 };

 // Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
 await _recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

 // Waits for completion.
 // Use Task.WaitAny to keep the task rooted.
 Task.WaitAny(new[] { stopRecognition.Task });

 // Stops recognition.
 await _recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);

} catch (ObjectDisposedException ex) { _logger.Log(System.Diagnostics.TraceLevel.Error, ex, "The queue processing task object has been disposed."); } catch (Exception ex) { // Catch all other exceptions and log _logger.Log(System.Diagnostics.TraceLevel.Error, ex, "Caught Exception"); } }

The expected behavior is that the Recognized event will fire after a brief period of silence by the speaker in that channel, but in reality, the Recognizing event keeps firing until some seemingly random time. This behavior is not observed when UnmixedAudio is set to false, but it's not the desired set up.

1fabi0 commented 8 months ago

Can you try to record the audio, I know some time back there was a issue with the decoder or something that a returned a pattern in audio data instead of silence, this still could also be here the case, maybe you can try to record the audio data to a file and listen if there is some noise instead of the expected silence, then check for the pattern and replace the package with a real empty package (the audio packages also have a flag indicating if the audio package is silence)