Solve user agent camera/microphone double-mute

jan-ivar commented 3 years ago

User agent mute-toggles for camera & mic can be useful, yielding enhanced privacy (no need to trust site), and quick access (a sneeze coming on, or a family member walking into frame?)

Safari has pause/resume in its URL bar
Firefox has global cam/mic mute toggles behind a pref (set privacy.webrtc.globalMuteToggles in about:config)
Chrome has recently opened an issue discussing it.

It's behind a pref in Firefox because:

The double-mute problem: site-mute + ua-mute = 4 states, where 3 produce no sound ("Can you hear me now?")
UA-mute of microphone interferes with "Are you talking?" features
Some sites (Meet) stop camera to work around crbug 642785 in Chrome, so there's no video track to UA-mute

This image is titled: "Am I muted?"

This issue is only about (1) the double-mute problem.

We determined we can only solve the double-mute problem by involving the site, which requires standardization.

The idea is:

If the UA mutes or unmutes, the site should update its button to match.
If the user unmutes using the site's button, the UA should unmute(!)

The first point requires no spec change: sites can listen to the mute and unmute events on the track (but they don't).

The second point is key: if the user sees the site's button turn to "muted", they'll expect to be able to click it to unmute.

This is where it gets tricky, because we don't want to allow sites to unmute themselves at will, as this defeats any privacy benefits.

The proposal here is:

partial interface MediaStreamTrack {
  undefined unmute();
}

It would throw InvalidStateError unless it has transient activation, is fully active, and has focus. User agents may also throw NotAllowedError for any reason, but if they don't then they must unmute the track (which will fire the unmute event).

This should let user agents that wish to develop UX without the double-mute problem.

dontcallmedom-bot commented 11 months ago

This issue was discussed in WebRTC November 2023 meeting – 21 November 2023 (Multi-mute (Three Thumbs Up - Setup) 🎞︎)

guidou commented 11 months ago

@guidou , you mentionned muted can be fired in Chrome for various reasons. Can you list some of them?

For camera tracks, Chrome just checks if frames have not been received for some time (25 expected frame intervals in most cases), regardless of the underlying reason. This maps well to the spec text that states If live samples are not made available to the MediaStreamTrack it is muted, but does not map well to user-actionable mute reasons, since cameras muted by a user often continue sending frames. Moreover, there are other non-user-mute reasons that can cause a track to stop receiving frames and thus make the muted attribute true. For example, sometimes capture is suspended temporarily for efficiency reasons if no sink is connected to a source and that causes tracks to not receive frames.

For audio it turns out that microphones always keep sending samples. In this case Chrome tries to use a system-specific defition to set the muted attribute. For example, on macOS, it uses AudioObjectGetPropertyData with kAudioDevicePropertyMute, on Windows it uses ISimpleAudioVolume::GetMute.

The reason I think we need an extra attribute is that the spec definition (and at least Chrome's implementation for camera tracks) does not map well to mute caused by user action (and therefore unmutable by user action as well).

youennf commented 11 months ago

Thanks @guidou, this is really helpful info.

This maps well to the spec text

The spec allows it. I wonder though whether this model is actually helping web developers. For instance, is it better to have a black video or a frozen video when the track is sent via WebRTC? The stats API that @henbos is working on could be more appropriate for web developers.

FWIW, in Safari, if we do not receive video frames/audio frames after a given amount of time, we fail the capture. We assume that something is wrong and that the web application be better restarting capture, maybe with a different device. Some web applications do this, not all of them sadly.

These browser differences are making developer's life difficult. I wonder whether this space is now mature enough that we could get browsers to share a more consistent model around muted and capture suspension/failure. @jan-ivar, how is Firefox using muted these days for capture tracks? Is Firefox sometimes failing capture?

If we cannot converge on the model, I think it is ok to disambiguate things for the web app. We could go with a boolean value, I do not think differentiating between OS and UA as mute source is providing much benefit.

there are other non-user-mute reasons

I like this wording and definition better than the previous actionable or amenable proposed definitions. Basically a capture track was muted following a user decision. unspecified and user-choice make some sense to me.

@jan-ivar also mentioned MediaSession togglecamera and togglemicrophone, which could be either done in lieu or in addition to the boolean. This might be a good signal since it is closer to what user does and what OS/UA show to user. Similarly, requestToggleCamera/requestToggleMicrophone could have the benefit of making it easier to have consistent UXs between web apps and UA/IOS.

guidou commented 11 months ago

Thanks @guidou, this is really helpful info.

This maps well to the spec text

The spec allows it. I wonder though whether this model is actually helping web developers.

It's certainly not useful to express muting caused by user actions such as toggling a mute control somewhere (HW/OS). I don't know if anyone uses it for other purposes, but I wouldn't be surprised if someone does.

For instance, is it better to have a black video or a frozen video when the track is sent via WebRTC? The stats API that @henbos is working on could be more appropriate for web developers.

FWIW, in Safari, if we do not receive video frames/audio frames after a given amount of time, we fail the capture. We assume that something is wrong and that the web application be better restarting capture, maybe with a different device. Some web applications do this, not all of them sadly.

These browser differences are making developer's life difficult. I wonder whether this space is now mature enough that we could get browsers to share a more consistent model around muted and capture suspension/failure. @jan-ivar, how is Firefox using muted these days for capture tracks? Is Firefox sometimes failing capture?

If we cannot converge on the model, I think it is ok to disambiguate things for the web app. We could go with a boolean value,

Having a boolean that can be used by the application to know that there is a mute due to user action (or that can be undone via a user action) would be a significant improvement over the current definition.

I do not think differentiating between OS and UA as mute source is providing much benefit.

Knowing what specific action the user can take to unmute would allow the application to provide guidance to the user and would be very beneficial for users. It's common for users to forget that they muted and/or how, and often they do it accidentally by unintentionally clicking some toggle somewhere (HW or OS). It's very common to see reports of cameras or microphones "not working" when in fact they had been muted by the user (possibly unintentionally).

there are other non-user-mute reasons

I like this wording and definition better than the previous actionable or amenable proposed definitions. Basically a capture track was muted following a user decision. unspecified and user-choice make some sense to me.

I think we can start with that. It would be a significant improvement over the current situation. We can discuss extending reasons as a follow up by carefully weighing actual privacy risks against the benefit to the user. as part of the requestToggleX() or requestUnmute() discussion.

@jan-ivar also mentioned MediaSession togglecamera and togglemicrophone, which could be either done in lieu or in addition to the boolean. This might be a good signal since it is closer to what user does and what OS/UA show to user. Similarly, requestToggleCamera/requestToggleMicrophone could have the benefit of making it easier to have consistent UXs between web apps and UA/IOS.

Can you elaborate on how this would work? Will we need to use the media session API to find out if a track or device was user-muted? or just that we would reuse those events and fire them on a track? Something else?

youennf commented 11 months ago

Knowing what specific action the user can take to unmute would allow the application to provide guidance to the user.

As said before, I am not against providing this information once user takes action to actually unmute and UA has issues unmuting.

Can you elaborate on how this would work?

Basically, when user mutes capture with OS/UA UX, UA would fire the mute event on capture tracks and would execute the togglemicrophone action callback. The web page would update its UI within this action callback, either based on capture tracks muted state or on media session additional info. Ditto when user unmutes via OS/UA UX.

guidou commented 11 months ago

As said before, I am not against providing this information once user takes action to actually unmute and UA has issues unmuting.

I don't necessarily agree that some level of detail before user action would be detrimental to privacy, but if we expose the user-choice / unspecified boolean in some form prior to user action and leave extra details after user action, that would be a significant improvement that we can use to move the discussion forward.

Can you elaborate on how this would work?

Basically, when user mutes capture with OS/UA UX, UA would fire the mute event on capture tracks and would execute the togglemicrophone action callback. The web page would update its UI within this action callback, either based on capture tracks muted state or on media session additional info. Ditto when user unmutes via OS/UA UX.

How would this be used in systems with multiple microphones or cameras. with some of them muted and some of them unmuted? Where would the user-choice/unspecified value be exposed?

eladalon1983 commented 11 months ago

As soon as the mute event occurs, a Web app should be able to read all of the relevant state. The proposal using togglemicrophone does not appear to me to satisfy this requirement. But if we extend MediaStreamTrack to expose getMuteReasons(), the new state will become immediately exposed at the time mute were fired.

youennf commented 11 months ago

I agree that the media session action handler should not need to go to MediaStreamTrack to do its processing. It seems it is missing something that would make it functional, something like:

partial MediaSessionActionDetails dictionary {
  bool muting;
}

In that case, it seems better to actually design the action handler to execute first, and the mute events to fire second. This seems consistent with how the spec is designed in general.

Also, maybe we should deprecate setMicrophoneState and setCameraState.

How would this be used in systems with multiple microphones or cameras.

I am not sure this is needed, but the toggle action scope could be placed in MediaSessionActionDetails should the need arise. The scope would be the device, which is the lowest level we should probably go.

eladalon1983 commented 11 months ago

I agree that the media session action handler should not need to go to MediaStreamTrack to do its processing. It seems it is missing something that would make it functional, something like:
partial MediaSessionActionDetails dictionary {
  bool muting;
}
In that case, it seems better to actually design the action handler to execute first, and the mute events to fire second. This seems consistent with how the spec is designed in general.

The interaction between the callback and the mute handler is complex and error-prone when multiple mute/unmute actions happen in short succession. A reasonable event listener to mute should expect to just be able to read the most recent state without much worry.
It is unclear how to handle multiple peripherals.
What about non-mic, non-camera tracks like screen-sharing? Note that CrOS already shows UX that allows stopping those; in the future, we might allow pausing (muting).

I think the following proposal is better:

interface MuteReason {
  readonly boolean upstream;
};

partial interace MediaStreamTrack {
  sequence<MuteReason> getMuteReasons();
};

This is simple, it solves the problem, it is immediately available when the mute event fires, and it's fully extensible in the future. For example, we could in the future extend MuteReason as:

enum MuteSource {"unspecified", "user-agent", "operating-system", "hardware"};

interface MuteReason {
  readonly boolean upstream;
  readonly MuteSource source;
};

guidou commented 11 months ago

I agree that the media session action handler should not need to go to MediaStreamTrack to do its processing. It seems it is missing something that would make it functional, something like:
partial MediaSessionActionDetails dictionary {
  bool muting;
}
How would this be used in systems with multiple microphones or cameras.

I am not sure this is needed, but the toggle action scope could be placed in MediaSessionActionDetails should the need arise. The scope would be the device, which is the lowest level we should probably go.

It is essential to know which devices are muted and which ones aren't. Multiple cameras and/or microphones is a very common case. The user-choice/unspecified values (or whaterver name/form we choose) exposed as mute reasons on MediaStreamTrack look a lot simpler to me and are a straightforward complement to the MST muted attribute.

Media session looks like it is currently a poor fit that needs a lot of changes to a different spec to support our use case. The only argument for it is that it has events called togglemicrophone and togglecamera, which do not add any significant value over what we have in MST, since it already has mute and unmute events we can use. The real issue is the new state.

With media session:

Is there a way to know the initial state if there aren't any toggle events? This is essential to set the correct state in the app UI.
Is there a way to know the state per device? It would be useless to know that some device is muted if the app cannot tell it is the device currently used by the user.

eladalon1983 commented 11 months ago

During the editors' meeting, Youenn suggested extending togglemicrophone to receive the the mute-state, and possibly make other extensions to address other issues. In that case, the answers to Guido's questions would be "yes":

Is there a way to know the initial state if there aren't any toggle events? Is there a way to know the state per device?

I think it would still be unhelpful to go down that rabbit hole. The show-stoppers are:

Microphone and camera are not the only things that can be muted; screen-share is also a concern. Adding screensharetoggle and other xtoggle/ytoggle/ztoggle would not scale. We don't need a separate API surface for each thing that can be muted.
Reasonable Web applications should be able to listen to the mute event, read the new state and take action then. This requires an API surface that's updated in conjunction with that event - the Media Session handlers don't fulfil that requirement.

eladalon1983 commented 11 months ago

I've published a skeleton of a PR in https://github.com/w3c/mediacapture-main/pull/979 - PTAL. If you think togglemicrophone is a preferable approach, it would be helpful to see a PR to that effect so we could contrast the two.

alvestrand commented 11 months ago

When discussing muting, we should also reflect on the (long) discussion on VideoTrackGenerator.mute - https://github.com/w3c/mediacapture-transform/issues/81

youennf commented 11 months ago

I see benefits in MediaSession approach. It is an API used for updating UI/visible application states which is exactly what we are talking about here. It also seems easier to do these updates in a single callback, compared to requiring the web app to potentially coalesce itself multiple mute events.

There are things that should be improved in MediaSession, independently on whether we expose muted reason or not. I filed some MediaSession issues for that purpose. It makes sense to coordinate with the Media WG on this particular issue.

With regards to the definition of mute reasons, it makes sense to me to piggy-back on MediaSession. In that sense, it seems ok to me to expose to JS that MediaStreamTrack mute/unmute events are triggered by a MediaSession toggle action.

eladalon1983 commented 11 months ago

To help the discussion culminate in a decision, comparing PRs would be helpful. I have produced a PR for the approach I suggested. @youennf, could you produce a PR for yours?

youennf commented 11 months ago

Here is a media session based proposal:

For the simple use case (one camera, one microphone), nothing is needed, just use the existing mediaSession API

To ease developer's life, introduce:

partial dictionary MediaSessionActionDetails {
boolean isMuting;
}

To support multiple capture, introduce

partial dictionary MediaSessionActionDetails {
sequence<DOMString> deviceIds;
}

To support screen capture, introduce a togglescreenshare media session action.

These seem like valid improvements to the existing MediaSession API, independently of whether we expose a boolean on MediaStreamTrack to help disambiguating muted. Or maybe we should think of removing togglemicrophone/togglecamera, if we think onmute/onunmute is superior.

It would help to get the sense of MediaSession people, @steimelchrome, @jan-ivar, thoughts?

I think it is worth preparing slides for both versions, doing PRs now seems premature. The main thing is to look at it from a web dev convenience point of view. In particular, since the point is to update UI, is it more convenient to use tracks or to use media session API:

single callback vs. multiple events.
Track can be transferred, media session cannot.
Where to unmute, track or media session?

jan-ivar commented 11 months ago

Here is a media session based proposal:

For the simple use case (one camera, one microphone), nothing is needed, just use the existing mediaSession API

I like this proposal. I don't see a need to add more information since this seems to be exactly what the mediaSession API was built for (whether the toggles are in a desktop browser UX or on a phone lock screen seems irrelevant).

Initial state seems solved by firing the mediaSession events early, e.g. on pageload.

This issue is "Solve user agent camera/microphone double-mute", putting other sources out of scope.

Multiple devices also seems out of scope since none of the global UA toggles so far (Safari or Firefox) work per-device AFAIK. They're page or browser global toggles, extending controls present in video conference pages today into the browser, imbuing them with ease of access and some privacy assurance that the webpage cannot hear them, solving the simple use cases of users not being heard, or worrying they can be heard (by participants or webpage). I.e. they affect all devices that page has.

I think Chrome's mute behavior is a bug. I've filed https://github.com/w3c/mediacapture-main/issues/982 to clarify the spec, so let's discuss that there.

I think we should standardize requesting unmute. I don't think we should standardize requesting mute. PRs ahead of decisions should not be required.

Too much in this thread.

guidou commented 11 months ago

Here is a media session based proposal: For the simple use case (one camera, one microphone), nothing is needed, just use the existing mediaSession API

We need to solve all use cases that arise in practice, not just the simplest one.

I like this proposal. I don't see a need to add more information since this seems to be exactly what the mediaSession API was built for (whether the toggles are in a desktop browser UX or on a phone lock screen seems irrelevant).

Initial state seems solved by firing the mediaSession events early, e.g. on pageload.

This issue is "Solve user agent camera/microphone double-mute", putting other sources out of scope.

We need to solve all use cases that arise in practice, not just the ones indicated in the first message of this thread.

Multiple devices also seems out of scope since none of the global UA toggles so far (Safari or Firefox) work per-device AFAIK. They're page or browser global toggles, extending controls present in video conference pages today into the browser, imbuing them with ease of access and some privacy assurance that the webpage cannot hear them, solving the simple use cases of users not being heard, or worrying they can be heard (by participants or webpage). I.e. they affect all devices that page has.

Browser toggles are just one use case that needs to be handled. OS toggles (which can be per device, as in ChromeOS and maybe other OSes) need to be handled too. Hardware toggles need to be considered as well. Just because these were not mentioned in the original message doesn't really mean they're out of scope.

I think Chrome's mute behavior is a bug. I've filed w3c/mediacapture-main#982 to clarify the spec, so let's discuss that there.

It's not a bug, based on the current language of the spec. If the problem is that the mute attribute was defined wrongly, a better way to proceed would be to eliminate mute and its associated events from the spec and replace them with new ones with a new definition that matches the behavior we want today. This would allow us to introduce the new behavior without breaking existing applications and, once applications migrate, we can deprecate and remove the old attribute from implementations. We have done this successfully several times. The experience in Chromium with changing behavior to match spec redefinitions is much worse.

I think we should standardize requesting unmute. I don't think we should standardize requesting mute.

I agree. Apps already implement a way to mute at the app level.

PRs ahead of decisions should not be required.

Slides that show how the proposal solves the problems should be enough. We have a slot in the December 12 meeting to continue discussing this. If you have some slides available, maybe we can look at them then.

eladalon1983 commented 10 months ago

I don't think we should standardize requesting mute.

Was this suggested at some point?

PRs ahead of decisions should not be required.

PRs reveal the complexity that otherwise hides behind such phrases as "we could just..."

jan-ivar commented 10 months ago

I don't think we should standardize requesting mute.

Was this suggested at some point?

Yes in https://github.com/w3c/mediacapture-extensions/issues/39#issuecomment-1244530905.

We need to solve all use cases that arise in practice, not just the ones indicated in the first message of this thread.

This issue has 70 comments. Triaging discussion out to other (new or existing) issues such as https://github.com/w3c/mediacapture-main/issues/982 or https://github.com/w3c/mediasession/issues/279 seems worthwhile to me, or I don't see how we're going to reach any kind of consensus on all these feature requests. "Mute reason" probably deserves its own issue as well (there were 14 comments here when it was introduced to this conversation in https://github.com/w3c/mediacapture-extensions/issues/39#issuecomment-1805921604). It seems largely orthogonal to the OP proposal of letting apps unmute.

Browser toggles are just one use case that needs to be handled. OS toggles (which can be per device, as in ChromeOS and maybe other OSes) need to be handled too. Hardware toggles need to be considered as well.

These are all User Agent toggles IMHO, the details of which W3C specs tend to leave to the User Agent, focusing instead on the surface between web app and UA. I think that's the level of abstraction we need to be at.

eladalon1983 commented 10 months ago

I don't think we should standardize requesting mute.

Was this suggested at some point?

Yes in #39 (comment).

Thanks for clarifying. I share your opinion (@jan-ivar) about this proposal.

Mute reason" probably deserves its own issue [...] It seems largely orthogonal to the OP proposal of letting apps unmute.

Not completely orthogonal, because requestUnmute() requires some knowledge of the mute-reason, or else an app would be soliciting a useless user gesture from the user, to their disappointment and frustration.

These are all User Agent toggles IMHO, the details of which W3C specs tend to leave to the User Agent, focusing instead on the surface between web app and UA. I think that's the level of abstraction we need to be at.

As a representative of one open source browser who has filed bugs and looked into the code of another open source browser, I hope you'll find this comment compelling. It discusses the value transparency brings to the entire ecosystem.

jan-ivar commented 10 months ago

Instead of the OP proposal of a await track.unmute(), we might already have an API in https://github.com/w3c/mediasession/issues/279#issuecomment-1846023701:

navigator.mediaSession.setMicrophoneActive(false);

E.g. an app calling this with user attention and transient activation, may be enough of a signal to the UA to unmute tracks it has muted in this document, either raising a toast message about it after the fact, or a prompt ahead of it.

The remaining problem is how the app would learn whether unmuting was successful or not. E.g. might this suffice?

navigator.mediaSession.setMicrophoneActive(false); 
const [unmuted] = await Promise.all([
  new Promise(r => track.onunmute),
  new Promise(r => setTimeout(r, 0))
]);

youennf commented 10 months ago

setMicrophoneActive looks good to me if we can validate its actual meaning with the media wg. This API can be extended (return a promise, additional parameters) to progressively cover more of what has been discussed in this thread.

jan-ivar commented 10 months ago

Not completely orthogonal, because requestUnmute() requires some knowledge of the mute-reason, or else an app would be soliciting a useless user gesture from the user, to their disappointment and frustration.

Hiding an unmute control seems a small dent in the disappointment and frustration of being unable to unmute. IOW a secondary problem to the first.

eladalon1983 commented 10 months ago

either raising a toast message or a prompt ahead of it.

As I have mentioned multiple times before - the user agent has no idea what "shiny button text" means to the user, or what the user believed they were approving when they conferred transient activation on the page. Only the prompt-based approach is viable.

Hiding an unmute control seems a small dent in the disappointment and frustration of being unable to unmute.

It does not look at all "small" to me. In fact, I am shocked that after months of debating whether an API should be sync or async, which would have no user-visible effect, you label this major user-visible issue as "small." What is the methodology you employ to classify the gravity of issues?

eladalon1983 commented 10 months ago

Hiding an unmute control seems a small dent in the disappointment and frustration of being unable to unmute.

I repeat - there is nothing "small" about a user clicking a button and it disappearing without having an effect. It looks like a bug and it would nudge users towards abandoning the Web app in favor of a native-app competitor. Web developers care much more about their users' perception of the app's reliability, than they do about the inconvenience of adding "await" to a method invocation. Let's focus our attention where it matters!

eladalon1983 commented 10 months ago

Thank you for this engagement, Jan-Ivar. I am looking forward to hear why you disagree.

Orthogonally, I'll be proposing that the rules of conduct in the WG be amended to discourage the use of the thumbs-down emoji without elaboration. Noting disagreement without elaborating on the reasons serves no productive purpose.

dontcallmedom-bot commented 10 months ago

This issue was discussed in WebRTC December 12 2023 meeting – 12 December 2023 (Solve user agent camera/microphone double-mute (mediacapture-extensions))

jan-ivar commented 1 month ago

Closing this as the double-mute problem was instead solved in https://github.com/w3c/mediasession/pull/312.

Here's an example of how a website can synchronize application mute state with that of the browser.

w3c / mediacapture-extensions

Solve user agent camera/microphone double-mute #39

This issue is only about (1) the double-mute problem.