microsoft / cognitive-services-speech-sdk-js

Microsoft Azure Cognitive Services Speech SDK for JavaScript
Other
265 stars 98 forks source link

Speaker Diarization with ConversationTranscriber not possible without audio samples #426

Closed matej-svejda closed 3 years ago

matej-svejda commented 3 years ago

I'm trying to transcribe conversations (mostly recordings of online meetings) with multiple participants and word level timestamps.

For this usecase I don't have audio samples of the conversation participants. According to the docs it should be possible (here https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/conversation-transcription it says that "User voice samples are optional"). I wasn't able to get this to work though. I tried just creating a participant and leaving the voice parameter as null:

conversation.addParticipantAsync(sdk.Participant.From("speaker_1", "en-US", null), ...)

But I'm still getting "Unidentified" for the speaker id.

Here is the sample code https://gist.github.com/matej-svejda/cb63560263e80652e1b7d97ffded7053 and here is the input audio as a zip file sample_8.wav.zip.

Am I using the API wrong? Or is this a bug?

glharper commented 3 years ago

@matej-sveda Thanks for using Speech SDK, and writing this issue up. That voice parameter is actually necessary for identification within the conversation, as it gives the transcription service an existing voice signature to try to identify audio against. Does that make sense?

matej-svejda commented 3 years ago

@glharper Thank you for your response. Are you sure this is a hard requirement? In the link I posted above it says:

User voice samples are optional. Without this input, the transcription will show different speakers, but shown as "Speaker1", "Speaker2", etc. instead of recognizing as pre-enrolled specific speaker names.

I wonder why it would be explicitly mentioned in the official documentation if it isn't possible. Also, all other transcription APIs I've worked with (AWS Transcribe, Google speech-to-text, Rev.ai) offer the ability to diarize speakers without voice samples. There are even open source libraries that are able to do that without transcribing the text, for example Resemblyzer as described here: https://medium.com/saarthi-ai/who-spoke-when-build-your-own-speaker-diarization-module-from-scratch-e7d725ee279 .

All this to say: I would be very surprised if the cognitive speech services wouldn't support this use case at all.

glharper commented 3 years ago

@matej-svejda There is indeed a gap between our documentation and the actual functionality here, and I completely understand the frustration you're expressing. @HeidiHanZhang Could you comment on this?

HeidiHanZhang commented 3 years ago

Hi @matej-svejda, diarize different speakers is not a default behavior, you need to turn on this feature by adding one line of code speechconfig.SetProperty("DifferentiateGuestSpeakers", "True"); We are working on update our sample code. it will be published soon. Thanks.

matej-svejda commented 3 years ago

Hi @HeidiHanZhang thank you for the repsonse. I got some sort of diarization to work by setting "DifferenciateGuestSpeakers" to true (I think you have a typo in your comment because you spelled "differentiate" with a "t" and i think it should be "differenciate" with a "c"). However, it is very inaccurate for the recording that I posted above, so I'm very much looking forward to sample code in case I'm missing something.

matej-svejda commented 3 years ago

Hi @HeidiHanZhang , for me it didn't work with "DifferentiateGuestSpeakers", but did work with "DifferenciateGuestSpeakers" (even though not very well). Also here https://github.com/microsoft/cognitive-services-speech-sdk-js/blob/e61da5bff3529ca87e075594b684b20654327059/src/common.speech/Transcription/ConversationConnectionConfig.ts#L19 it is defined as "DifferenciateGuestSpeakers". Regarding the accuracy: I also observed that some labels were changed after a while, but it still was pretty inacurate. Here is the output for the audio file used in my example:

Guest_1 Then I wake up my kids.
Guest_0 And we have breakfast together at about 8:30. Really, yes, we usually have something easy like bread and yogurt and fruit. I like to have coffee every morning. Whether I wake up at 6:00 or at 10, I'm still going to have coffee.
Guest_0 
Guest_1 Do you always eat breakfast every day?
Guest_1 If I don't eat breakfast, I'm so hungry.
Guest_0 What about lunch?
Guest_0 What time do you have lunch?
Guest_0 Lunch is the same every day for me. I always eat lunch at 12:30 PM.
Guest_0 So whenever I wake up, I do some things and then I always.

The correct labeling would be:

Guest_1 Then I wake up my kids.
Guest_1 And we have breakfast together at about 8:30.
Guest_0 Really
Guest_1 Yes, we usually have something easy like bread and yogurt and fruit.
Guest_0 I like to have coffee every morning. Whether I wake up at 6:00 or at 10, I'm still going to have coffee. But I often skip breakfast.
Guest_0 Do you always eat breakfast every day?
Guest_1 If I don't eat breakfast, I'm so hungry.
Guest_1 What about lunch?
Guest_1 What time do you have lunch?
Guest_0 Lunch is the same every day for me. I always eat lunch at 12:30 PM.
Guest_0 So whenever I wake up, I do some things and then I always.

So as you can see, there are quite a few mistakes.

But as I've said, it's still quite possible that I'm using the API wrong, so thats why I'm looking forward to the example you've mentioned.

HeidiHanZhang commented 3 years ago

Hi @matej-svejda, I just talked with our engineering team, for the parameter, differenciate is a wrong word but seems we made some typo in the code so it becomes confusing now. glad you find the current right one. for your accuracy problem, there could be many possibilities, like the input audio it self's quality or code defect. Do you mind to send your accuracy question thru our email CTSFeatureCrew@microsoft.com so we can help you troubleshoot further? Thanks.