microsoft / psi

Platform for Situated Intelligence
https://github.com/microsoft/psi/wiki
Other
538 stars 96 forks source link

Azure Kinect - Record audio for the Microsoft Cognitive Service - Conversation Transcription #97

Open tteichmeister opened 3 years ago

tteichmeister commented 3 years ago

Hello,

I have a question to PSI and the Azure Kinect DK. I would like to use PSI to record audio directly via the Azure Kinect DK and save as a wav file. The reason is that we want to use the Azure Kinect device for the conversation transcription service: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/conversation-transcription

The service requires a device that meets the following specifications: Requires a 7-mic circular multi-microphone array with a playback reference stream. These are met by the Azure Kinect.

As I have seen in the documentation from PSI we could use the AudioCapture to record the audio (store with the WaveFileWriter).

The problem is, that I was not able to produce an audio file with the correct channels inside. As you can see in the example from conversation transcription service the audio file has to include 8 channels (7 + 1 silent channel).

var wavfileStream = Helper.OpenWavFile("16kHz16Bits8channelsOfRecordedPCMAudio.wav");

If I use PSI (AudioCapture, WaveFileWriter) I was not able to create a corresponding wav file with that. So I had to use Audacity to record the audio with the Azure Kinect (7-channels) and add another silent channel to it.

Is there another way how we could achieve the same result with PSI too?

Any help would be great!

chitsaw commented 3 years ago

Try using the following when instantiating the AudioCapture component:

new AudioCapture(p, WaveFormat.CreatePcm(16000, 16, 8)

That should produce an audio stream with 8 channels.

tteichmeister commented 3 years ago

First, thank you very much for your fast response it did work. Therefore the sample which you gave to me was able to produce a valid audio file for the conversation transcription service.

But I have one more question to the recorded file if it's ok?

If I compare the result which we got from PSI and Audacity we see some differences.

PSI: image

Audacity: image

As you can see, when we are using Audacity the 1-7 channels are including voice and the last one contains silence. If we are using PSI only the 1-6 channels including voice and the last two channels are silence.

Do you know if there is any reason for that or am I doing something wrong ?

Thanks again.

chitsaw commented 3 years ago

We are seeing the same thing as you. We'll need to investigate further.

tteichmeister commented 3 years ago

thanks for your reply. It would be great if you could find something out about this issue.

best regards

danieljlevine commented 2 years ago

Hey there, I'm interested in the same type of thing, but now in 2022. :-)

I don't really want to create .wav files to be processed, I just want to stream the audio from a single (at the moment) Azure Kinect and have the Azure Kinect produce the 7 channels from it's 7 mics and somehow inject a blank 8-th channel. That way, people speaking in a room could have their conversation transcribed. Is this a solved problem? Currently I'm not using anything from the Azure Kinect DK's SDK, I just plugged the USB device in and am trying to use it with the Cognitive Services - Conversation Transcription quickstart. The quickstart works great with the KatieAndSteve.wav file, so I know I'm generally configured right. And the Azure Kinect works fine with the speech to text from microphone quickstart. I just can't figure out how to make the conversation transcription like the Azure Kinect as an input microphone.

I made these simple changes to the conversation transcription quickstart so that it would use the Azure Kinect for input:

// Line 83 in Program.cs
config.SetProperty("DifferentiateGuestSpeakers", "true"); // We only have guests.  Not looking to identify who it actually was speaking, just differentiate speakers.

// Line 88:
//using (var audioInput = AudioStreamReader.OpenWavFile(conversationWaveFile))
using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())  // Use microphone instead of .wav file

But it results in this:

CANCELED: Reason=Error
CANCELED: ErrorCode=RuntimeError
CANCELED: ErrorDetails=Exception with an error code: 0x1b (SPXERR_RUNTIME_ERROR) SessionId: 0cba61aa7a08426b82d87e77bd9e03b3

I assume it's because the Azure Kinect doesn't produce the assuming input format. :-/

How might one transform the Azure Kinect's audio output to what is required for Conversation Transcription to ingest it?

Perhaps just a blank 8th channel needs to be added to it? If so, is that a solved problem? It would also seem that each channel needs to be PCM, 16-bit, 16khz sampled , monaural.

sandrist commented 2 years ago

Are you attempting to use \psi for this work? We do not currently have a component for Conversation Transcription. I believe the original issue above involved using the \psi Azure Kinect component to create wav files in such a way that they could then be used by the Conversation Transcription service outside of \psi.

I'm not familiar with how Conversation Transcription works exactly, but at first glance, I suspect that there might be a different flavor of AudioConfig.FromDefaultMicrophoneInput() that you need to find, to produce audio in the right format. Or perhaps that method takes in some optional configuration parameters?...

danieljlevine commented 2 years ago

Good point. stupid question: what’s psi?

danieljlevine commented 2 years ago

I did see this: AudiConfig.FromDefaultMicrophoneInput(AudioProcessingOptions)

it would let me set various processing options that I assume describe what type of microphone is being used. Not sure it would suddenly make the Azure Kinect suddenly produce 7+1 channels in with the +1 being quiet (seems to be what people report it should be here).