microsoftgraph / microsoft-graph-comms-samples

Microsoft Graph Communications Samples
MIT License
206 stars 221 forks source link

Transcribe 'UnmixedMeetingAudio' buffer as soon as those are received. #430

Open muradhaider5786 opened 3 years ago

muradhaider5786 commented 3 years ago

This is a question to the community but not an issue. I am able to receive unmixed audio in AudioMediaReceived event handler inside CallHandler.cs class, during P2P or group call/online meeting. Now I want to generate transcript out of it. I want some guidance about:

  1. Can this be done at the end of the meeting, I am confused as it's stated on a number of places that

    "You may not use this SDK to record or otherwise persist media content from calls or meetings that your bot accesses" .

In which scenario this fits? Is to possible to generate wave file out of the byte array (UnmixedAudioBuffer data) received and use that for transcription at the end of the meeting?

  1. Since we receive audio frames at a very quick pace (50 frames per second), If I am to generate the transcript asynchronously and real time, how should I do it. Is it about converting buffer data to .wav or in memory stream and using some cloud speech service/API? There's a speech service provided by Microsoft as well as speech to text resource provided by Google cloud. your guidance is highly appreciated.
1fabi0 commented 3 years ago
  1. Yes you can save the PCM data or pack it into wav files so you can later upload these to create a transcript
  2. You have to Marshal synchronous and then dispose the buffer then you can live handle the audio actually for easy tasks like uploading to speech service it's pretty simple by just using Task.Run but the pace of 50 times per second is something you have to live with in real time media processing I think it's a not too rapid pace because 99.99% computers work way faster so it's very easy too handle
ssulzer commented 3 years ago

Hi @muradhaider5786 Unless you are developing a Policy Recording bot application (https://docs.microsoft.com/en-us/graph/api/call-updaterecordingstatus?view=graph-rest-1.0&tabs=http), your bot may not persist media, either audio/video or media which has been transformed from audio/video (like a transcript). So creating a recording or saving a transcript is against the license terms unless you have some written permission from Microsoft. cc: @jsandoval-msft, @zhengni-msft

muradhaider5786 commented 3 years ago

Thanks for the response guys, really appreciate. @ssulzer I have created and assigned a policy to the bot app following this document: https://docs.microsoft.com/en-us/microsoftteams/teams-recording-policy#compliance-recording-policy-assignment-and-provisioning is this what I need so media should persist? @1fabi0 : I am adding byte arrays from all received buffers in a generic list of type 'bytearray' and at the end of the call, I am passing that list to my methods utilizing Speech-to-text service (Microsoft Speech Service). But for meeting organizer I am receiving empty text event though byte arrays are not empty, and for other participant i am getting incorrect results.

1fabi0 commented 3 years ago

I think you are mixing up multiple streams and because of this you get incorrect results please make sure that you adding everything up in the correct order and distinct the users(streams) correctly also make sure you transmitting the audio packages in the correct order etc.

muradhaider5786 commented 3 years ago

Thanks for the quick response, I add the byte arrays in a list as soon as those are received in event handler. I have pasted the in another question but just for reference:

byte[] managedByteArray = new byte[e.Buffer.UnmixedAudioBuffers[0].Length];
                  int length = (int)e.Buffer.UnmixedAudioBuffers[0].Length;
                  Marshal.Copy(e.Buffer.UnmixedAudioBuffers[0].Data, managedByteArray, 0, length);
                  this.AudioByteArrayList.Add(managedByteArray);

how can I ensure if the streams should not mix up. Thanks!

1fabi0 commented 3 years ago

Unmixed contains a audio source ID which is the ID of the audio socket of the user also for channel meetings the organizers audio ends up in the mixed audio buffer

muradhaider5786 commented 3 years ago

Yes, It's in the context of a channel meeting, there are two participants, one is organizer and one is (guest) participant. The organizer is put on mute before bot is added. So, bot just captures the participant's audio. I have confirmed this by looking into received buffers, as zero buffers are received for organizer ( I am filtering out the silent buffers). Having a list of Participant's byte array, speech-to-text isn't returning any text or returns incorrect text

1fabi0 commented 3 years ago

It seems to be a problem with your speech to text service make sure you selecteted the correct language also correct accent(en-us vs. en-uk, etc. ) and also that you are pushing the audio correct