Add MediaStreamTrack voice activity detection support.

jianjunz commented 2 weeks ago

This change adds support for the voice activity detection (VAD) feature for audio MediaStreamTracks. It is only enabled when voiceActivityDetection constraint is set to true.

With voiceactivitydetected event, web applications are able to show notifications when the user is speaking but audio track is muted.

Fixes #145.

Preview | Diff

youennf commented 2 weeks ago

I wonder whether this should actually be at MediaStreamTrack level. Maybe we do not need a constraint either.

Given this event would fire when the track is muted, the goal would be to unmute the track, which would be done in via MediaSession API. Moving this API to MediaSession makes some sense.

Maybe all we need is a new MediaSession voiceActivity action. Registering this handler would kick in the necessary UA logic to trigger this action.

@jan-ivar, @guidou, thoughts?

jianjunz commented 1 week ago

I'm wondering when this action should be triggered if voiceActivity is moved from MediaStreamTrack to MediaSession

voice activity detection on default audio input device
voice activity detection on any audio input devices
voice activity detection on any audio input devices with MediaStreamTrack created

1 and 2 may have privacy issue because users may not want applications to know their behavior before granting "microphone" permission.

With current AudioWorklet approach, applications are able to know which track has voice activity. I personally believe applications only want to detect voice activity for microphone with MediaStreamTrack created and muted, but I'm not sure if any application applies vad for other audio tracks.

youennf commented 1 week ago

The privacy story should be the same whatever the API shape. I agree with having a voiceActivity MediaSession action only for contexts that have live (and muted) microphone MediaStreakTracks.

If we want to support multimicrophone cases, a deviceId could be exposed within MediaSessionActionDetails.

I personally believe applications only want to detect voice activity for microphone with MediaStreamTrack created and muted

Agreed for the scope of this specific API.

jan-ivar commented 1 week ago

I agree with having a voiceActivity MediaSession action only for contexts that have live (and muted) microphone MediaStreakTracks.

Moving this to media session makes sense to me as well.

youennf commented 1 week ago

@steimelchrome FYI

youennf commented 1 week ago

@jianjunz , would you be ok drafting a PR on MediaSession WG ? I can take over if you prefer.

guidou commented 1 week ago

Since this is intended to help the user to unmute via the unmute button in the app, which would be done via MediaSession, it makes sense that this notification comes via MediaSession. Given that this is largely a MediaSession thing, I don't think we should have a requirement that a MediaStreamTrack is muted (although it will most likely be).

bradisbell commented 1 week ago

I do not think there is any sense in moving this to MediaSession. There are far more use cases for voice activity detection beyond letting the user know that they may be muted. A couple use cases I would implement immediately if this API were available:

During a long recording session, adding metadata to the recording of when the user may have been speaking. (Think of situations like overdubbing commentary.)
Triggering transcription/translation only when someone is speaking.
Alerting remote users that speaking is happening. Currently doing this today in a live production scenario just based on audio peak meters and usually waving at the camera enough until the remote production folks see that and unmute. Would be nice to detect speaking and then visually highlight that remote camera by making it blink or something, in addition to the usual more blunt cues.

These use cases and others like them rely on the voice activity detection firing on the track.

Besides, even if it were moved to MediaSession, choosing the right capture track to trigger on is not possible at the user agent level. It's not uncommon to have several capture tracks. The relevant captured track might even be "remote". (Think of cases where a local second device/screen/camera/mic is set up. Connected via WebRTC, but right there in the room.) Only the application truly knows what is what.

youennf commented 1 week ago

There are far more use cases for voice activity detection beyond letting the user know that they may be muted

This was discussed during the WebRTC WG meeting and we think there are two usecases which deserve two different solutions.

The first use case is allowing to unmute when user is talking while muted. This PR is about this specific issue and moving it to MediaSession seems good.

The second use case, which you seem more interested, is exposing whether a live unmuted track contains voice. This needs more thoughts as firing an event will always be more or less out of sync with the audio data. And it can already be implemented with audio worklet (though less efficiently) where the extracted data will be in sync with audio. This use case seems more tied to MediaStreamTrack than MediaSession.

jianjunz commented 1 week ago

@jianjunz , would you be ok drafting a PR on MediaSession WG ? I can take over if you prefer.

Sure, I'll create a PR on MediaSession WG. Thanks.

jianjunz commented 1 week ago

Closing this one as it's moved to MediaSession spec pr333.

w3c / mediacapture-extensions

Add MediaStreamTrack voice activity detection support. #153