Azure Kinect - Record audio for the Microsoft Cognitive Service - Conversation Transcription #97

I have a question to PSI and the Azure Kinect DK. I would like to use PSI to record audio directly via the Azure Kinect DK and save as a wav file. The reason is that we want to use the Azure Kinect device for the conversation transcription service:

The service requires a device that meets the following specifications: Requires a 7-mic circular multi-microphone array with a playback reference stream. These are met by the Azure Kinect.

As I have seen in the documentation from PSI we could use the AudioCapture to record the audio (store with the WaveFileWriter).

The problem is, that I was not able to produce an audio file with the correct channels inside. As you can see in the example from conversation transcription service the audio file has to include 8 channels (7 + 1 silent channel).

var wavfileStream = Helper.OpenWavFile("16kHz16Bits8channelsOfRecordedPCMAudio.wav");

If I use PSI (AudioCapture, WaveFileWriter) I was not able to create a corresponding wav file with that. So I had to use Audacity to record the audio with the Azure Kinect (7-channels) and add another silent channel to it.

Is there another way how we could achieve the same result with PSI too?

Any help would be great!

Try using the following when instantiating the AudioCapture component:

new AudioCapture(p, WaveFormat.CreatePcm(16000, 16, 8)

That should produce an audio stream with 8 channels.

First, thank you very much for your fast response it did work. Therefore the sample which you gave to me was able to produce a valid audio file for the conversation transcription service.

But I have one more question to the recorded file if it's ok?

If I compare the result which we got from PSI and Audacity we see some differences.

PSI: image

Audacity: image

As you can see, when we are using Audacity the 1-7 channels are including voice and the last one contains silence. If we are using PSI only the 1-6 channels including voice and the last two channels are silence.

Do you know if there is any reason for that or am I doing something wrong ?

Thanks again.

We are seeing the same thing as you. We'll need to investigate further.

thanks for your reply. It would be great if you could find something out about this issue.

best regards

Hey there, I'm interested in the same type of thing, but now in 2022. :-)

I don't really want to create .wav files to be processed, I just want to stream the audio from a single (at the moment) Azure Kinect and have the Azure Kinect produce the 7 channels from it's 7 mics and somehow inject a blank 8-th channel. That way, people speaking in a room could have their conversation transcribed. Is this a solved problem? Currently I'm not using anything from the Azure Kinect DK's SDK, I just plugged the USB device in and am trying to use it with the Cognitive Services - Conversation Transcription quickstart. The quickstart works great with the KatieAndSteve.wav file, so I know I'm generally configured right. And the Azure Kinect works fine with the speech to text from microphone quickstart. I just can't figure out how to make the conversation transcription like the Azure Kinect as an input microphone.

I made these simple changes to the conversation transcription quickstart so that it would use the Azure Kinect for input:

// Line 83 in Program.cs
config.SetProperty("DifferentiateGuestSpeakers", "true"); // We only have guests.  Not looking to identify who it actually was speaking, just differentiate speakers.

// Line 88:
//using (var audioInput = AudioStreamReader.OpenWavFile(conversationWaveFile))
using (var audioInput = AudioConfig.FromDefaultMicrophoneInput())  // Use microphone instead of .wav file

But it results in this:

CANCELED: Reason=Error
CANCELED: ErrorCode=RuntimeError
CANCELED: ErrorDetails=Exception with an error code: 0x1b (SPXERR_RUNTIME_ERROR) SessionId: 0cba61aa7a08426b82d87e77bd9e03b3

I assume it's because the Azure Kinect doesn't produce the assuming input format. :-/

How might one transform the Azure Kinect's audio output to what is required for Conversation Transcription to ingest it?

Perhaps just a blank 8th channel needs to be added to it? If so, is that a solved problem? It would also seem that each channel needs to be PCM, 16-bit, 16khz sampled , monaural.

Are you attempting to use \psi for this work? We do not currently have a component for Conversation Transcription. I believe the original issue above involved using the \psi Azure Kinect component to create wav files in such a way that they could then be used by the Conversation Transcription service outside of \psi.

I'm not familiar with how Conversation Transcription works exactly, but at first glance, I suspect that there might be a different flavor of AudioConfig.FromDefaultMicrophoneInput() that you need to find, to produce audio in the right format. Or perhaps that method takes in some optional configuration parameters?...

Good point. stupid question: what’s psi?

I did see this: AudiConfig.FromDefaultMicrophoneInput(AudioProcessingOptions)

it would let me set various processing options that I assume describe what type of microphone is being used. Not sure it would suddenly make the Azure Kinect suddenly produce 7+1 channels in with the +1 being quiet (seems to be what people report it should be here).