Closed jan-ivar closed 1 month ago
This issue was discussed in WebRTC November 2023 meeting – 21 November 2023 (Multi-mute (Three Thumbs Up - Setup) 🎞︎)
@guidou , you mentionned muted can be fired in Chrome for various reasons. Can you list some of them?
For camera tracks, Chrome just checks if frames have not been received for some time (25 expected frame intervals in most cases), regardless of the underlying reason. This maps well to the spec text that states If live samples are not made available to the MediaStreamTrack it is muted
, but does not map well to user-actionable mute reasons, since cameras muted by a user often continue sending frames. Moreover, there are other non-user-mute reasons that can cause a track to stop receiving frames and thus make the muted attribute true. For example, sometimes capture is suspended temporarily for efficiency reasons if no sink is connected to a source and that causes tracks to not receive frames.
For audio it turns out that microphones always keep sending samples. In this case Chrome tries to use a system-specific defition to set the muted attribute. For example, on macOS, it uses AudioObjectGetPropertyData with kAudioDevicePropertyMute, on Windows it uses ISimpleAudioVolume::GetMute.
The reason I think we need an extra attribute is that the spec definition (and at least Chrome's implementation for camera tracks) does not map well to mute caused by user action (and therefore unmutable by user action as well).
Thanks @guidou, this is really helpful info.
This maps well to the spec text
The spec allows it. I wonder though whether this model is actually helping web developers. For instance, is it better to have a black video or a frozen video when the track is sent via WebRTC? The stats API that @henbos is working on could be more appropriate for web developers.
FWIW, in Safari, if we do not receive video frames/audio frames after a given amount of time, we fail the capture. We assume that something is wrong and that the web application be better restarting capture, maybe with a different device. Some web applications do this, not all of them sadly.
These browser differences are making developer's life difficult. I wonder whether this space is now mature enough that we could get browsers to share a more consistent model around muted and capture suspension/failure. @jan-ivar, how is Firefox using muted these days for capture tracks? Is Firefox sometimes failing capture?
If we cannot converge on the model, I think it is ok to disambiguate things for the web app. We could go with a boolean value, I do not think differentiating between OS and UA as mute source is providing much benefit.
there are other non-user-mute reasons
I like this wording and definition better than the previous actionable
or amenable
proposed definitions. Basically a capture track was muted following a user decision. unspecified
and user-choice
make some sense to me.
@jan-ivar also mentioned MediaSession togglecamera
and togglemicrophone
, which could be either done in lieu or in addition to the boolean. This might be a good signal since it is closer to what user does and what OS/UA show to user.
Similarly, requestToggleCamera
/requestToggleMicrophone
could have the benefit of making it easier to have consistent UXs between web apps and UA/IOS.
Thanks @guidou, this is really helpful info.
This maps well to the spec text
The spec allows it. I wonder though whether this model is actually helping web developers.
It's certainly not useful to express muting caused by user actions such as toggling a mute control somewhere (HW/OS). I don't know if anyone uses it for other purposes, but I wouldn't be surprised if someone does.
For instance, is it better to have a black video or a frozen video when the track is sent via WebRTC? The stats API that @henbos is working on could be more appropriate for web developers.
FWIW, in Safari, if we do not receive video frames/audio frames after a given amount of time, we fail the capture. We assume that something is wrong and that the web application be better restarting capture, maybe with a different device. Some web applications do this, not all of them sadly.
These browser differences are making developer's life difficult. I wonder whether this space is now mature enough that we could get browsers to share a more consistent model around muted and capture suspension/failure. @jan-ivar, how is Firefox using muted these days for capture tracks? Is Firefox sometimes failing capture?
If we cannot converge on the model, I think it is ok to disambiguate things for the web app. We could go with a boolean value,
Having a boolean that can be used by the application to know that there is a mute due to user action (or that can be undone via a user action) would be a significant improvement over the current definition.
I do not think differentiating between OS and UA as mute source is providing much benefit.
Knowing what specific action the user can take to unmute would allow the application to provide guidance to the user and would be very beneficial for users. It's common for users to forget that they muted and/or how, and often they do it accidentally by unintentionally clicking some toggle somewhere (HW or OS). It's very common to see reports of cameras or microphones "not working" when in fact they had been muted by the user (possibly unintentionally).
there are other non-user-mute reasons
I like this wording and definition better than the previous
actionable
oramenable
proposed definitions. Basically a capture track was muted following a user decision.unspecified
anduser-choice
make some sense to me.
I think we can start with that. It would be a significant improvement over the current situation.
We can discuss extending reasons as a follow up by carefully weighing actual privacy risks against the benefit to the user. as part of the requestToggleX()
or requestUnmute()
discussion.
@jan-ivar also mentioned MediaSession
togglecamera
andtogglemicrophone
, which could be either done in lieu or in addition to the boolean. This might be a good signal since it is closer to what user does and what OS/UA show to user. Similarly,requestToggleCamera
/requestToggleMicrophone
could have the benefit of making it easier to have consistent UXs between web apps and UA/IOS.
Can you elaborate on how this would work? Will we need to use the media session API to find out if a track or device was user-muted? or just that we would reuse those events and fire them on a track? Something else?
Knowing what specific action the user can take to unmute would allow the application to provide guidance to the user.
As said before, I am not against providing this information once user takes action to actually unmute and UA has issues unmuting.
Can you elaborate on how this would work?
Basically, when user mutes capture with OS/UA UX, UA would fire the mute event on capture tracks and would execute the togglemicrophone
action callback.
The web page would update its UI within this action callback, either based on capture tracks muted state or on media session additional info. Ditto when user unmutes via OS/UA UX.
As said before, I am not against providing this information once user takes action to actually unmute and UA has issues unmuting.
I don't necessarily agree that some level of detail before user action would be detrimental to privacy, but if we expose the user-choice
/ unspecified
boolean in some form prior to user action and leave extra details after user action, that would be a significant improvement that we can use to move the discussion forward.
Can you elaborate on how this would work?
Basically, when user mutes capture with OS/UA UX, UA would fire the mute event on capture tracks and would execute the
togglemicrophone
action callback. The web page would update its UI within this action callback, either based on capture tracks muted state or on media session additional info. Ditto when user unmutes via OS/UA UX.
How would this be used in systems with multiple microphones or cameras. with some of them muted and some of them unmuted? Where would the user-choice
/unspecified
value be exposed?
As soon as the mute
event occurs, a Web app should be able to read all of the relevant state. The proposal using togglemicrophone
does not appear to me to satisfy this requirement. But if we extend MediaStreamTrack
to expose getMuteReasons()
, the new state will become immediately exposed at the time mute
were fired.
I agree that the media session action handler should not need to go to MediaStreamTrack to do its processing. It seems it is missing something that would make it functional, something like:
partial MediaSessionActionDetails dictionary {
bool muting;
}
In that case, it seems better to actually design the action handler to execute first, and the mute events to fire second. This seems consistent with how the spec is designed in general.
Also, maybe we should deprecate setMicrophoneState
and setCameraState
.
How would this be used in systems with multiple microphones or cameras.
I am not sure this is needed, but the toggle
action scope could be placed in MediaSessionActionDetails
should the need arise. The scope would be the device, which is the lowest level we should probably go.
I agree that the media session action handler should not need to go to MediaStreamTrack to do its processing. It seems it is missing something that would make it functional, something like:
partial MediaSessionActionDetails dictionary { bool muting; }
In that case, it seems better to actually design the action handler to execute first, and the mute events to fire second. This seems consistent with how the spec is designed in general.
mute
handler is complex and error-prone when multiple mute/unmute actions happen in short succession. A reasonable event listener to mute
should expect to just be able to read the most recent state without much worry.I think the following proposal is better:
interface MuteReason {
readonly boolean upstream;
};
partial interace MediaStreamTrack {
sequence<MuteReason> getMuteReasons();
};
This is simple, it solves the problem, it is immediately available when the mute
event fires, and it's fully extensible in the future. For example, we could in the future extend MuteReason as:
enum MuteSource {"unspecified", "user-agent", "operating-system", "hardware"};
interface MuteReason {
readonly boolean upstream;
readonly MuteSource source;
};
I agree that the media session action handler should not need to go to MediaStreamTrack to do its processing. It seems it is missing something that would make it functional, something like:
partial MediaSessionActionDetails dictionary { bool muting; }
How would this be used in systems with multiple microphones or cameras.
I am not sure this is needed, but the
toggle
action scope could be placed inMediaSessionActionDetails
should the need arise. The scope would be the device, which is the lowest level we should probably go.
It is essential to know which devices are muted and which ones aren't. Multiple cameras and/or microphones is a very common case. The user-choice
/unspecified
values (or whaterver name/form we choose) exposed as mute reasons on MediaStreamTrack look a lot simpler to me and are a straightforward complement to the MST muted
attribute.
Media session looks like it is currently a poor fit that needs a lot of changes to a different spec to support our use case.
The only argument for it is that it has events called togglemicrophone
and togglecamera
, which do not add any significant value over what we have in MST, since it already has mute
and unmute
events we can use. The real issue is the new state.
With media session:
During the editors' meeting, Youenn suggested extending togglemicrophone
to receive the the mute-state, and possibly make other extensions to address other issues. In that case, the answers to Guido's questions would be "yes":
Is there a way to know the initial state if there aren't any toggle events? Is there a way to know the state per device?
I think it would still be unhelpful to go down that rabbit hole. The show-stoppers are:
screensharetoggle
and other xtoggle
/ytoggle
/ztoggle
would not scale. We don't need a separate API surface for each thing that can be muted.mute
event, read the new state and take action then. This requires an API surface that's updated in conjunction with that event - the Media Session handlers don't fulfil that requirement.I've published a skeleton of a PR in https://github.com/w3c/mediacapture-main/pull/979 - PTAL. If you think togglemicrophone
is a preferable approach, it would be helpful to see a PR to that effect so we could contrast the two.
When discussing muting, we should also reflect on the (long) discussion on VideoTrackGenerator.mute - https://github.com/w3c/mediacapture-transform/issues/81
I see benefits in MediaSession approach. It is an API used for updating UI/visible application states which is exactly what we are talking about here. It also seems easier to do these updates in a single callback, compared to requiring the web app to potentially coalesce itself multiple mute events.
There are things that should be improved in MediaSession, independently on whether we expose muted reason or not. I filed some MediaSession issues for that purpose. It makes sense to coordinate with the Media WG on this particular issue.
With regards to the definition of mute reasons, it makes sense to me to piggy-back on MediaSession.
In that sense, it seems ok to me to expose to JS that MediaStreamTrack
mute/unmute events are triggered by a MediaSession toggle action.
To help the discussion culminate in a decision, comparing PRs would be helpful. I have produced a PR for the approach I suggested. @youennf, could you produce a PR for yours?
Here is a media session based proposal:
partial dictionary MediaSessionActionDetails {
boolean isMuting;
}
partial dictionary MediaSessionActionDetails {
sequence<DOMString> deviceIds;
}
togglescreenshare
media session action.These seem like valid improvements to the existing MediaSession API, independently of whether we expose a boolean on MediaStreamTrack to help disambiguating muted. Or maybe we should think of removing togglemicrophone/togglecamera, if we think onmute/onunmute is superior.
It would help to get the sense of MediaSession people, @steimelchrome, @jan-ivar, thoughts?
I think it is worth preparing slides for both versions, doing PRs now seems premature. The main thing is to look at it from a web dev convenience point of view. In particular, since the point is to update UI, is it more convenient to use tracks or to use media session API:
Here is a media session based proposal:
For the simple use case (one camera, one microphone), nothing is needed, just use the existing mediaSession API
I like this proposal. I don't see a need to add more information since this seems to be exactly what the mediaSession API was built for (whether the toggles are in a desktop browser UX or on a phone lock screen seems irrelevant).
Initial state seems solved by firing the mediaSession events early, e.g. on pageload.
This issue is "Solve user agent camera/microphone double-mute", putting other sources out of scope.
Multiple devices also seems out of scope since none of the global UA toggles so far (Safari or Firefox) work per-device AFAIK. They're page or browser global toggles, extending controls present in video conference pages today into the browser, imbuing them with ease of access and some privacy assurance that the webpage cannot hear them, solving the simple use cases of users not being heard, or worrying they can be heard (by participants or webpage). I.e. they affect all devices that page has.
I think Chrome's mute behavior is a bug. I've filed https://github.com/w3c/mediacapture-main/issues/982 to clarify the spec, so let's discuss that there.
I think we should standardize requesting unmute. I don't think we should standardize requesting mute. PRs ahead of decisions should not be required.
Too much in this thread.
Here is a media session based proposal: For the simple use case (one camera, one microphone), nothing is needed, just use the existing mediaSession API
We need to solve all use cases that arise in practice, not just the simplest one.
I like this proposal. I don't see a need to add more information since this seems to be exactly what the mediaSession API was built for (whether the toggles are in a desktop browser UX or on a phone lock screen seems irrelevant).
Initial state seems solved by firing the mediaSession events early, e.g. on pageload.
This issue is "Solve user agent camera/microphone double-mute", putting other sources out of scope.
We need to solve all use cases that arise in practice, not just the ones indicated in the first message of this thread.
Multiple devices also seems out of scope since none of the global UA toggles so far (Safari or Firefox) work per-device AFAIK. They're page or browser global toggles, extending controls present in video conference pages today into the browser, imbuing them with ease of access and some privacy assurance that the webpage cannot hear them, solving the simple use cases of users not being heard, or worrying they can be heard (by participants or webpage). I.e. they affect all devices that page has.
Browser toggles are just one use case that needs to be handled. OS toggles (which can be per device, as in ChromeOS and maybe other OSes) need to be handled too. Hardware toggles need to be considered as well. Just because these were not mentioned in the original message doesn't really mean they're out of scope.
I think Chrome's mute behavior is a bug. I've filed w3c/mediacapture-main#982 to clarify the spec, so let's discuss that there.
It's not a bug, based on the current language of the spec. If the problem is that the mute
attribute was defined wrongly, a better way to proceed would be to eliminate mute
and its associated events from the spec and replace them with new ones with a new definition that matches the behavior we want today. This would allow us to introduce the new behavior without breaking existing applications and, once applications migrate, we can deprecate and remove the old attribute from implementations. We have done this successfully several times. The experience in Chromium with changing behavior to match spec redefinitions is much worse.
I think we should standardize requesting unmute. I don't think we should standardize requesting mute.
I agree. Apps already implement a way to mute at the app level.
PRs ahead of decisions should not be required.
Slides that show how the proposal solves the problems should be enough. We have a slot in the December 12 meeting to continue discussing this. If you have some slides available, maybe we can look at them then.
I don't think we should standardize requesting mute.
Was this suggested at some point?
PRs ahead of decisions should not be required.
PRs reveal the complexity that otherwise hides behind such phrases as "we could just..."
I don't think we should standardize requesting mute.
Was this suggested at some point?
Yes in https://github.com/w3c/mediacapture-extensions/issues/39#issuecomment-1244530905.
We need to solve all use cases that arise in practice, not just the ones indicated in the first message of this thread.
This issue has 70 comments. Triaging discussion out to other (new or existing) issues such as https://github.com/w3c/mediacapture-main/issues/982 or https://github.com/w3c/mediasession/issues/279 seems worthwhile to me, or I don't see how we're going to reach any kind of consensus on all these feature requests. "Mute reason" probably deserves its own issue as well (there were 14 comments here when it was introduced to this conversation in https://github.com/w3c/mediacapture-extensions/issues/39#issuecomment-1805921604). It seems largely orthogonal to the OP proposal of letting apps unmute.
Browser toggles are just one use case that needs to be handled. OS toggles (which can be per device, as in ChromeOS and maybe other OSes) need to be handled too. Hardware toggles need to be considered as well.
These are all User Agent toggles IMHO, the details of which W3C specs tend to leave to the User Agent, focusing instead on the surface between web app and UA. I think that's the level of abstraction we need to be at.
I don't think we should standardize requesting mute.
Was this suggested at some point?
Yes in #39 (comment).
Thanks for clarifying. I share your opinion (@jan-ivar) about this proposal.
Mute reason" probably deserves its own issue [...] It seems largely orthogonal to the OP proposal of letting apps unmute.
Not completely orthogonal, because requestUnmute()
requires some knowledge of the mute-reason, or else an app would be soliciting a useless user gesture from the user, to their disappointment and frustration.
These are all User Agent toggles IMHO, the details of which W3C specs tend to leave to the User Agent, focusing instead on the surface between web app and UA. I think that's the level of abstraction we need to be at.
As a representative of one open source browser who has filed bugs and looked into the code of another open source browser, I hope you'll find this comment compelling. It discusses the value transparency brings to the entire ecosystem.
Instead of the OP proposal of a await track.unmute()
, we might already have an API in https://github.com/w3c/mediasession/issues/279#issuecomment-1846023701:
navigator.mediaSession.setMicrophoneActive(false);
E.g. an app calling this with user attention and transient activation, may be enough of a signal to the UA to unmute tracks it has muted in this document, either raising a toast message about it after the fact, or a prompt ahead of it.
The remaining problem is how the app would learn whether unmuting was successful or not. E.g. might this suffice?
navigator.mediaSession.setMicrophoneActive(false);
const [unmuted] = await Promise.all([
new Promise(r => track.onunmute),
new Promise(r => setTimeout(r, 0))
]);
setMicrophoneActive looks good to me if we can validate its actual meaning with the media wg. This API can be extended (return a promise, additional parameters) to progressively cover more of what has been discussed in this thread.
Not completely orthogonal, because requestUnmute() requires some knowledge of the mute-reason, or else an app would be soliciting a useless user gesture from the user, to their disappointment and frustration.
Hiding an unmute control seems a small dent in the disappointment and frustration of being unable to unmute. IOW a secondary problem to the first.
either raising a toast message or a prompt ahead of it.
As I have mentioned multiple times before - the user agent has no idea what "shiny button text" means to the user, or what the user believed they were approving when they conferred transient activation on the page. Only the prompt-based approach is viable.
Hiding an unmute control seems a small dent in the disappointment and frustration of being unable to unmute.
It does not look at all "small" to me. In fact, I am shocked that after months of debating whether an API should be sync or async, which would have no user-visible effect, you label this major user-visible issue as "small." What is the methodology you employ to classify the gravity of issues?
Hiding an unmute control seems a small dent in the disappointment and frustration of being unable to unmute.
I repeat - there is nothing "small" about a user clicking a button and it disappearing without having an effect. It looks like a bug and it would nudge users towards abandoning the Web app in favor of a native-app competitor. Web developers care much more about their users' perception of the app's reliability, than they do about the inconvenience of adding "await" to a method invocation. Let's focus our attention where it matters!
Thank you for this engagement, Jan-Ivar. I am looking forward to hear why you disagree.
Orthogonally, I'll be proposing that the rules of conduct in the WG be amended to discourage the use of the thumbs-down emoji without elaboration. Noting disagreement without elaborating on the reasons serves no productive purpose.
Closing this as the double-mute problem was instead solved in https://github.com/w3c/mediasession/pull/312.
Here's an example of how a website can synchronize application mute state with that of the browser.
User agent mute-toggles for camera & mic can be useful, yielding enhanced privacy (no need to trust site), and quick access (a sneeze coming on, or a family member walking into frame?)
privacy.webrtc.globalMuteToggles
in about:config)It's behind a pref in Firefox because:
This image is titled: "Am I muted?"
This issue is only about (1) the double-mute problem.
We determined we can only solve the double-mute problem by involving the site, which requires standardization.
The idea is:
The first point requires no spec change: sites can listen to the mute and unmute events on the track (but they don't).
The second point is key: if the user sees the site's button turn to "muted", they'll expect to be able to click it to unmute.
This is where it gets tricky, because we don't want to allow sites to unmute themselves at will, as this defeats any privacy benefits.
The proposal here is:
It would throw
InvalidStateError
unless it has transient activation, is fully active, and has focus. User agents may also throwNotAllowedError
for any reason, but if they don't then they must unmute the track (which will fire the unmute event).This should let user agents that wish to develop UX without the double-mute problem.