microsoft / psi

Platform for Situated Intelligence
https://github.com/microsoft/psi/wiki
Other
538 stars 96 forks source link

Includes an audio stream for the mixed reality device's microphone using MediaCapture. #254

Closed austinbhale closed 2 years ago

austinbhale commented 2 years ago

StereoKit's microphone is unreliable at times since it crashes unexpectedly using an external audio library called MiniAudio. For more information on this crash, please see https://github.com/StereoKit/StereoKit/issues/449.

As an alternative, we can turn to Microsoft's MediaCapture implementation to gain access to the microphone's audio buffer. This pull request adds functionality for a new PhotoVideoCameraConfiguration called AudioSettings.

A general use case would be:

var videoStreamSettings = new PhotoVideoCameraConfiguration.StreamSettings
{
    FrameRate = 15,
    ImageWidth = 896,
    ImageHeight = 504,
    OutputEncodedImage = false,
    OutputEncodedImageCameraView = true,
    MixedRealityCapture = new MixedRealityCaptureVideoEffect(Windows.Media.Capture.MediaStreamType.Photo, 0.9f),
};

var audioStreamSettings = new PhotoVideoCameraConfiguration.AudioSettings
{ 
    OutputAudio = true,
};

var camera = new PhotoVideoCamera(pipe, new PhotoVideoCameraConfiguration
{
    VideoStreamSettings = videoStreamSettings,
    AudioStreamSettings = audioStreamSettings,
});

Not included in this pull request, but food for thought: It might be nice to rename the PhotoVideoCamera class to something more versatile like MediaCapture. This way, one wouldn't be as confused if they want only an audio stream from a MediaCapture instance.

ghost commented 2 years ago

CLA assistant check
All CLA requirements met.

sandrist commented 2 years ago

Thanks for this PR! The code looks good, but I'm wondering if it would be possible to actually separate this out into its own component? Perhaps called something like MediaCaptureMicrophone. It can then be more easily reusable in other places. Or does it require access to shared resources (and coordinated initialization) with the PhotoVideoCamera?

austinbhale commented 2 years ago

Great idea! Initially, I was worried about having multiple MediaCapture instances for the same device, but as long as the target streams don't interfere (one for audio, one for video), then it isn't an issue.

Let me know if you can think of any more changes to add to this.

austinbhale commented 2 years ago

Thanks for the review! I implemented both of your ideas in https://github.com/microsoft/psi/pull/254/commits/2b4cc28b9a0f8b0474c97aae00d019b7252a4c65.

In addition to these changes, there might be instances where there are two audio channels available and developers want to specify which channel they'd like to downsample to a single-channel format. For this use case, I specify an AudioChannelNumber to grab data for the requested channel number. If the requested channel number is out of range, an exception is thrown.

This could be useful for MediaCapture devices that have a greater distance between microphones. For example, the HL2 has its first audio channel as the top-left mic & channel two as the top-center mic.

chitsaw commented 2 years ago

Thanks for your changes! It occurred to me on looking at the latest that we don't actually need the MicrophoneConfiguration property in MediaCaptureMicrophoneConfiguration. SamplingInterval is never used, and we might not need AudioFormat either as we are currently only using it to determine whether or not to downsample to a single channel.

To make this component not assume a fixed audio format of 48 kHz 2 channel 32-bit floating-point PCM audio, we should really be constructing the AudioFormat that gets posted with the AudioBuffer based on the AudioEncodingProperties of the received audio frame. You are already doing this to compute the sampling rate and number of channels for the captured audio. To get the FormatTag for the WaveFormat, you could infer this from the AudioEncodingProperties.Subtype property, or it's probably fine for now to just assume IEEE 32-bit float PCM since that is always going to be the case for HoloLens (maybe throw an exception if you see something else that you don't currently handle?) Also, if you're downsampling to 1 channel, then obviously you'd want to set WaveFormat.Channels to 1, otherwise it will be equal to numAudioChannels.

Hopefully this all makes sense?

austinbhale commented 2 years ago

I agree - the wave format should be generated based on the audio source's properties with MediaCapture. I believe I addressed your points in the latest commit.

For inferring the subtype, I created a dictionary that maps the subtype strings to their respective WaveFormatTag. For the tags that weren't one-to-one translatable (from https://github.com/microsoft/psi/blob/master/Sources/Audio/Microsoft.Psi.Audio/WaveFormatTag.cs), I set them to WAVE_FORMAT_UNKNOWN. Maybe someone more familiar with audio codecs could address the rest of the categorizations in the future?