w3c / mediacapture-extensions

Extensions to Media Capture and Streams by the WebRTC Working Group
https://w3c.github.io/mediacapture-extensions/
Other
19 stars 15 forks source link

Solve user agent camera/microphone double-mute #39

Open jan-ivar opened 2 years ago

jan-ivar commented 2 years ago

User agent mute-toggles for camera & mic can be useful, yielding enhanced privacy (no need to trust site), and quick access (a sneeze coming on, or a family member walking into frame?)

It's behind a pref in Firefox because:

  1. The double-mute problem: site-mute + ua-mute = 4 states, where 3 produce no sound ("Can you hear me now?")
  2. UA-mute of microphone interferes with "Are you talking?" features
  3. Some sites (Meet) stop camera to work around crbug 642785 in Chrome, so there's no video track to UA-mute

This image is titled: "Am I muted?"

This issue is only about (1) the double-mute problem.

We determined we can only solve the double-mute problem by involving the site, which requires standardization.

The idea is:

  1. If the UA mutes or unmutes, the site should update its button to match.
  2. If the user unmutes using the site's button, the UA should unmute(!)

The first point requires no spec change: sites can listen to the mute and unmute events on the track (but they don't).

The second point is key: if the user sees the site's button turn to "muted", they'll expect to be able to click it to unmute.

This is where it gets tricky, because we don't want to allow sites to unmute themselves at will, as this defeats any privacy benefits.

The proposal here is:

partial interface MediaStreamTrack {
  undefined unmute();
}

It would throw InvalidStateError unless it has transient activation, is fully active, and has focus. User agents may also throw NotAllowedError for any reason, but if they don't then they must unmute the track (which will fire the unmute event).

This should let user agents that wish to develop UX without the double-mute problem.

eladalon1983 commented 2 years ago

Repeating (and rephrasing) myself from Chromium bug since it'd be unreasonable to expect the audiences to be identical:

Media-controls exposed by the browser, which allows an ongoing mic/camera/screen-capture to be muted by the user, communicates an implicit promise from the browser to the user. If the application is allowed to override that promise, it's allowed to break that promise.

I understand the double-mute problem and hope that there could be other ways to resolve it. For example, maybe if the application tries to unmute the track, the browser could show the user a prompt to approve that. This would mean that a user can still click the red mic button to unmute, but because additional approval is required, the application cannot unmute unilaterally in response to an arbitrary user gesture.

jan-ivar commented 2 years ago

That's a good idea. Having the method be asynchronous would allow this.

partial interface MediaStreamTrack {
  Promise<undefined> unmute();
}

The goal of a spec here is to allow innovation in this space, without arriving at specific UX. It could be a prompt, or maybe a toast message is enough.

I think a lot of users would be surprised to learn that when they mute microphone or camera on a web site today, they have zero assurances that it actually happens. Well-behaved websites have lulled most users into feeling in control when they're not. Most probably don't consider that the website may turn camera or microphone back on at any time as long as the page is open.

The page doesn't even need to be open anymore: a different (origin) accomplice page can reopen/navigate to it without user interaction at a later point, since gUM doesn't require transient activation.

We dropped the ball on transient activation in gUM. Having better UA-integrated muting with transient activation might give us a second chance to rectify some privacy concerns.

For instance, a UA could choose to mute for the user if it detects a page accessing camera or microphone on pageload or without interaction.

youennf commented 1 year ago
Promise<undefined> unmute();

unmute() method seems fine. I wonder whether we should not try to introduce a mute() method as well. This would allow the website to mute itself and have its UI synchronised with OS/UA UI.

Also, in Safari, the mute/unmute is page wide, which means that all microphone tracks are either muted or unmuted. This does not align particularly well with unmute being at the track level. Maybe introducing unmute/mute at navigator.mediaDevices level would be good enough?

jan-ivar commented 1 year ago

I wonder whether we should not try to introduce a mute() method as well.

I fear this would create confusion and a false symmetry suggesting muting is under application control when it is not.

This would allow the website to mute itself and have its UI synchronised with OS/UA UI.

Applications have track.enabled for this. UAs "MAY" turn off indicators when this "brings all tracks connected to the device to be ... disabled," which Firefox has done since 2018. This is crbug 642785, I dunno if webkit has one.

Here's a fiddle for toggling multiple tracks demonstrating Firefox updating its camera URL bar indicator and OS/hardware light (modulo bug 1694304 on Windows for the hardware light), whenever the number of enabled tracks goes to zero.

I suggest leaving mute to UAs and concentrating on how apps can signal interest to unmute, to solve the issue at hand.

Also, in Safari, the mute/unmute is page wide,

Do you mean page wide or document wide? What about iframes?

... which means that all microphone tracks are either muted or unmuted. This does not align particularly well with unmute being at the track level.

Maybe we could use constraints? ๐Ÿ™‚

Maybe introducing unmute/mute at navigator.mediaDevices level would be good enough?

Maybe, except muted is a property of the track, not mediaDevices.

jan-ivar commented 1 year ago

There's also the issue of multiple cameras. If we end up with navigator.mediaDevices.unmute(track) that would really suck.

jan-ivar commented 1 year ago

A simplifying factor is that UA muting is 100% about privacy, and as soon as one track is unmuted on a page, then there's no more privacy. So per-track mute would serve no purpose, and make for a terrible API:

await Promise.all(applicationKeepsTrackOfAllTracksItIsUsing.map(track => track.unmute())); // ugh

But with that understanding (of UA mute as a privacy feature), it seems POLA for track.unmute() to unmute all tracks of the same source per document or per top-level document.

So I think I agree mute is a property of the source by that definition.

But there can be multiple sources in navigator.mediaDevices.[[mediaStreamTrackSources]], and cameras can be pointed different ways, so it's not inconceivable that a UA may wish to control privacy per camera.

Even if we don't care about that, we'd need navigator.mediaDevices.unmute(kind) which seems unappealing. I'd rather go with track.unmute().

dontcallmedom-bot commented 1 year ago

This issue had an associated resolution in WebRTC WG 2023-04-18 โ€“ (Issue 39: Solve user agent camera/microphone double-mute):

RESOLUTION: No objections.

youennf commented 1 year ago

track.unmute() makes sense if we think this is useful for all source types. For WebRTC or canvas tracks, this does not make sense. For screen sources, I am unclear yet whether we will want to mute all tracks (video and audio) or independently, CaptureController is the object representing the source, as such, the mute\unmute functionality could be placed there.

For capture tracks, InputDeviceInfo is what is closest to the source, hence why I was mentioning InputDeviceInfo.unmute as a possibility. Another difference to look at is that MediaStreamTrack is transferable, not InputDeviceInfo, we should consider this (though InputDeviceInfo can of course be made transferable in the future).

Page/document mute scope is probably covering 90% of the cases at least and is simpler to implement. But I feel muting at the source level is better in general and UA can always mute all sources if one gets muted.

I fear this would create confusion and a false symmetry suggesting muting is under application control when it is not.

unmute already introduces this potential confusion about who has control, hence the requestMute/requestUnmute name.

Applications have track.enabled for this. UAs "MAY" turn off indicators when this "brings all tracks connected to the device to be ... disabled," which Firefox has done since 2018. This is crbug 642785, I dunno if webkit has one.

This is a MAY though. Setting all tracks of the same source with enabled = false does not mean track.muted will switch to true, this is left to UA which does not seem great for interop. In Firefox mode, I would guess that muted would be set to true later on, when the application sets enabled = true on one of the track, in which case the application will then have to call unmute. This is not simple.

Looking at Safari, let's say that Safari would update its muted icon when all tracks are enabled = false. It would then need to immediately set muted = true to these tracks. Let's say user then clicks on Safari UI, all tracks will have muted = false, but it is then up to the application to register the muted event and do as if the user clicked on one of its own icon. Not simple again.

Looking at Web Applications, they tend to have clone tracks in window environment (local rendering and PC, potential different sizes as well), in the future in workers as well (for encoding/networking) or other windows (via transfer). Having to set enabled to false for each one of these objects, including transferred tracks, is cumbersome and potentially error prone.

Looking at OS support, kAUVoiceIOProperty_MutedSpeechActivityEventListener and kAUVoiceIOProperty_MuteOutput are potentially useful to implement the "Are you talking UI" in the screenshot you added above. It seems a worthwhile API addition we could consider in the future: if mic is muted, we will allow you to be notified that user might be speaking.

Overall, it seems cleaner to me to have two separate APIs:

youennf commented 1 year ago

Alternative to InputDeviceInfo is navigator.mediaDevices.requestUnmute(deviceId)

jan-ivar commented 1 year ago

track.unmute() makes sense if we think this is useful for all source types.

It was a WG design choice to reuse MST for other sources to avoid inheritance. The cost of that is living with the fact that not all sources have all track abilities, NOT that tracks only have abilities shared by all sources.

getUserMedia returns camera and microphone tracks, so adding attributes, methods and constraints specific to camera and microphone should be fine. If it's not, then time to split inheritance.

E.g. track.muted only makes sense for camera and microphone, and track.unmute() fits with that.

Other sources do not have the double-mute problem, so to not complicate discussion, let's not discuss them here.

This is a MAY though.

Here's a SHOULD: "When a "live", unmuted, and enabled track sourced by a device exposed by getUserMedia() becomes either muted or disabled, and this brings all tracks connected to the device (across all navigables the user agent operates) to be either muted, disabled, or stopped, then the UA SHOULD relinquish the device within 3 seconds..."

Setting all tracks of the same source with enabled = false does not mean track.muted will switch to true, this is left to UA which does not seem great for interop.

Why would a UA mute an in-view application just because it disabled its tracks? That would be terrible for web compat.

To clarify, Mozilla needs no spec changes to solve turning off privacy indicators or camera light. Our view is the path to interop there is changing the MAY and SHOULD to MUST. But please let's discuss that in a separate issue.

In Firefox mode, I would guess that muted would be set to true later on, when the application sets enabled = true on one of the track, in which case the application will then have to call unmute. This is not simple.

No that is not our plan. As explained in the OP we have a privacy.webrtc.globalMuteToggles pref in about:config which turns on the global user-facing mute controls shown. image, and we want to arm sites with tools to unmute themselves better to prepare for UA features like this.

Sorry for any misunderstanding, but it's not my intent to standardize UA muting here, only application-induced unmuting. Muting remains up to user agents, and I think it is important for privacy that they be allowed to continue to own that problem.

The scope of the proposal in the OP (and this issue) was to arm applications with tools to unmute themselves IF the user agent mutes them, not define when user agents mute them.

jan-ivar commented 1 year ago

unmute already introduces this potential confusion about who has control, hence the requestMute/requestUnmute name.

We have getUserMedia, not requestUserMedia, and that doesn't seem to confuse anybody.

NotAllowedError seems clear about who has control.

jan-ivar commented 1 year ago

Here's a SHOULD: "When a "live", unmuted, and enabled track sourced by a device exposed by getUserMedia() becomes either muted or disabled, and this brings all tracks connected to the device (across all navigables the user agent operates) to be either muted, disabled, or stopped, then the UA SHOULD relinquish the device within 3 seconds..."

FYI this was recently fixed upstream in https://webrtc-review.googlesource.com/c/src/+/302200

guidou commented 10 months ago

I'd like to revive this discussion, since these types of system-level controls (either at the UA or the OS) are becoming more common and we have observed that they create a lot of confusion for users. Like @jan-ivar says, there are two issues here:

If the UA mutes or unmutes, the site should update its button to match. If the user unmutes using the site's button, the UA should unmute(!)

However, I disagree with this statement:

The first point requires no spec change: sites can listen to the mute and unmute events on the track (but they don't)

The reason sites don't listen to the mute and unmute event is that mute and unmute can be triggered by other causes and if the application cannot know that those events (and themuted attribute) are caused by the double-mute problem, it cannot react appropriately. The spec says muted means live samples are not made available to the MediaStreamTrack, which is not specific to UA/OS-level mute controls. In Chrome specifically, muted means a track is not getting frames for any reason (and system-level mute has never been one of those reasons in practice). IIRC, Safari has similarities with Chrome in this regard.

This has become a major problem for VC applications and I think we need to solve it properly. I think we can iterate on several of the proposals made in this thread, which look like very promising IMO.

cc @eladalon1983

eladalon1983 commented 10 months ago

The problem as I see it is that users can mute through multiple sources - app, UA, OS and hardware. The propagation of state through these layers is presently incomplete - an opportunity for us to earn our keep.

In the high-level, I think we have to provide two mechanisms:

  1. Sites listen for mute-status changes from upstream sources (UA, OS and hardware, in that order).
  2. Sites control mute-status in upstream sources.

1. Listen

The principle here should be fairly uncontroversial. For the concrete mechanism, I agree with Guido that mute events are not currently well-suited. Either of the following would work for me:

  1. Bespoke events.
  2. Revive the idea of a MuteCause/MuteReason, so that the same mechanism would service both this issue as well as similar ones (see link).

I prefer no2. To start the ball rolling on a concerete proposal:

enum MuteCause {
  "unspecified",  // Catch-all default.
  "operating-system-choice",
  "user-agent-choice",
  // Extensible to hardware, and to -issue if it's not a -choice.
};

interface MuteEvent : Event {
  /* Exercise for the reader */
};

partial interface MediaStreamTrack {
  // Note that multiple causes might apply concurrently.
  readonly attribute sequence<MuteCause> causes;
};

2. Control

There's some potential for controversy here, but I think we can resolve it.

Jan-Ivar proposed:

If the user unmutes using the site's button, the UA should unmute(!)

While I'm sure VC apps would be delighted to have such control, I am afraid that no security/privacy department in any user agent would ever approve it (unless we add some UA-prompt; foreshadowing). Jan-Ivar suggested transient activation and focus as necessary gating mechanisms. These are fine requirements, but they are not sufficient as any transient activation would look identical to the user agent here, possibly subverting the user's actual intentions if they clicked on a mislabelled button. I'd suggest requiring also a PEPC-like prompt. Reasonable app code would then look something like this:

unmuteButton.addEventListener('click', "unmuteClicked();");

async function unmuteClicked() {
  // If necessary, prompt the user to unmute at UA-level etc.
  if (upstreamUnmute) {
    try {
      await track.unmute();
    } catch (error) {
      return;
    }
  }

  // Proceed with the "normal" unmuting in the app.
  // * Resume remote transmission of media.
  // * Change UX to reflect that clicking the button now means "mute".
  // * Update internal state.
}
youennf commented 10 months ago

I am not sure how much we need a mute reason. Distinct requestUnmute failures might be sufficient.

eladalon1983 commented 10 months ago
  1. You need a MuteReason because mute can happen for inactionable reasons too, like the source not having any new frames to deliver.
  2. Applications might not wish requestUnmute() themselves, but rather provide some reminder/hint to the user about where they muted from (either browser or operating system), and leave it to the user to unmute if they wish.
youennf commented 10 months ago

The mute reasons may vary within the muted period and firing mute events when only the reason is changing is not really appealing. The flow of user trying to unmute and app providing the hint based on the failure seems to cover most cases (and it does not preclude adding reasons in the future if we discover this is useful).

guidou commented 10 months ago

The mute reasons may vary within the muted period and firing mute events when only the reason is changing is not really appealing.

I'm OK with the mute event only firing when the muted attribute changes (not the muted reason attribute). WDYT? The main point iis that the muted attribute in its current form is not enough to solve this problem. Having a reason looks like a good way to make the muted attribute useful to solve this problem. Otherwise, we need a new attribute or a different API surface.

The flow of user trying to unmute and app providing the hint based on the failure seems to cover most cases (and it does not preclude adding reasons in the future if we discover this is useful).

Does this mean you support the approach of having an attribute for the mute cause?

eladalon1983 commented 9 months ago

The mute reasons may vary within the muted period and firing mute events when only the reason is changing is not really appealing.

Why is it not appealing to fire a mute event when the set of reasons changes? (Note that we have a separate unmute event already, btw.)

youennf commented 9 months ago

Does this mean you support the approach of having an attribute for the mute cause?

I see this as a potential improvement while I see an API to request unmute as a blocker. I would focus on the unblocking API.

I also think that an API to request capture to be muted would be useful. The current approach (use enabled=false on all tracks of a device), is a bit tedious and might face backward compatibility issues.

guidou commented 9 months ago

I see this as a potential improvement while I see an API to request unmute as a blocker. I would focus on the unblocking API.

I agree. Let's focus on that first.

I also think that an API to request capture to be muted would be useful. The current approach (use enabled=false on all tracks of a device), is a bit tedious and might face backward compatibility issues.

Also agree. This requires some more thinking because system-level muting is not necessarily equivalent to muting a source or a set of tracks.

eladalon1983 commented 9 months ago

I'm having some trouble parsing the last few messages on this thread. If we're all in agreement that we want to add an API exposing the OS-mute-state, then I'll gladly present something in the next available interim opportunity. Specifically, I'd like to present my proposal here. Before I do - @youennf, you have said that firing an event whenever the reason changes is unappealing. I'd still like to understand why; understanding would help drive a better presentation in said interim.

youennf commented 9 months ago

@guidou and I seem to agree on focusing on the following items (sorted by priority):

Getting back to requestUnmute, here are some possible API shapes (all promise based):

I would tend to unmute at the device level.

jan-ivar commented 9 months ago

The OP assumes users can always unmute. If there's a second actor controlling mute that the user agent cannot affect, then double-mute likely remains the best way to handle that piece. Otherwise we get:

image
A. Unmuted B. Muted (actionable) C. Muted (unactionable)

Today's apps show C as A.ยน To maintain this, they'd need to distinguish "actionable" from "unactionable" mute.

I'd support those two MuteReasons, but would avoid exposing more cross-origin correlatable information than that. I don't mind re-firing mute when reason changes.

Regarding method shape, I think track.unmute() is all it takes, because

  1. every method that can fail with NotAllowedError is a request, and
  2. muted is already a non-locally-configurable property of the track, so umute() would be a non-locally-contained action, a signal to the UA, which in turn ultimately controls the scope of that action (though it might be useful to define a minimally affected scope). How many documents a UA ultimately enforces mute or unmute upon seems implementation-defined.

1. A case could be made for showing C as B, but then nothing happens when the user clicks on it, which seems undesirable. This is an app decision of course.

eladalon1983 commented 9 months ago

@guidou and I seem to agree on focusing on the following items (sorted by priority):

We would have to ask Guido to see if you two agree, but I personally disagree with your prioritization, @youennf.

It's necessary for Web applications like Meet to know when the input devices is muted through the upstream (browser or OS), or else the app can't update its UX to show the input device is muted, which means the user won't understand what's wrong and won't press any app-based unmute button, which then means the app won't even call requestUnmute() - whatever its shape.

The very first step is for the application to know the upstream is muted. That's the top priority.

eladalon1983 commented 9 months ago
image

Above - an illustration. Without a MuteCause or a similar API, how should the Web app even know that the mic button should be changed to the muted-state, and that the onclick handler should request-unmute?

Top priority, imho.

guidou commented 9 months ago

@guidou and I seem to agree on focusing on the following items (sorted by priority):

We agree on the items, but IMO the highest priority is to add a mute reason in some form. Without a mute reason, requestUnmute is largely useless as the application won't be able to know when to call it.

guidou commented 9 months ago

I'd support those two MuteReasons, but would avoid exposing more cross-origin correlatable information than that. I don't mind re-firing mute when reason changes.

Only two MuteReasons is not enough to solve the problem since there are more than two reasons for which the track would be muted, with several of them are actionable in different ways. For example, a UA-level mute is actionable via the new requestUnmute() API, while an OS-level mute might be actionable via the same API on some platforms, but only via an informational message on others. Also, it is possible that there are multiple concurrent reasons, with different combinations being actionable (or not) in different ways.

The argument about correlatable information is valid only if we actually solve the problem, so we should solve it in the most privacy-preserving way possible. But we must provide a solution that actually works. If we don't solve the problem (and two mute reasons for sure don't solve the problem) we might as well remove the mute attribute which would be more private and just as useful as the attribute in its current form.

youennf commented 9 months ago

Exposing mute reasons may expose private information that the user may not want to expose, for instance the fact that the user will take a phone call in the middle of a video conference meeting. I think it is ok to expose that information when the user is trying to unmute. I do not think it is ok to expose that information to any web page that has capture permission/did capture in the past.

I understand the desire to provide detailed mute reasons. I see it more as a UI optimisation than a blocker for mute management by web pages, given this information would be provided, just later in the game when user actually asks to unmute. At this point, I think it is fine to give precise information.

Hence why I would tend to not start with this but start with the unmute API. The errors we would expose in the unmute API would help us with unmute reasons, should we want to do this.

A. Unmuted B. Muted (actionable) C. Muted (unactionable)

I understand this characterisation and agree this is an improvement from a privacy perspective. We could probably implement further mitigations (say expose updated actionable state for visible web pages only for instance). But again, the web application could go to state B when the mute event is fired and go to state C when the web application asks to unmute and unmuting fails, with a more precise hint of what the actual issue is.

2. How many documents a UA ultimately enforces mute or unmute upon seems implementation-defined.

It would be good if we could agree on a wider scope than track (at least source, maybe device). If we cannot, so be it. Maybe we could agree on guidelines.

Without a mute reason, requestUnmute is largely useless as the application won't be able to know when to call it.

@guidou, I do not really understand this, can you clarify why a web application won't be able to know when to call? I would think the web application would call this API whenever the user clicks on the resume capture button. Do you agree on the assumption that unmuting would be tied to a user gesture?

eladalon1983 commented 9 months ago

I understand the desire to provide detailed mute reasons. I see it more as a UI optimisation than a blocker for mute management by web pages, given this information would be provided, just later in the game when user actually asks to unmute.

I still don't understand how the app and user can work to requestUnmute() if the app doesn't even know the upstream is muted. Could you please explain?

Do you agree on the assumption that unmuting would be tied to a user gesture?

Yes, which is why the app MUST know when the upstream is muted, so that it could show the user an unmute-button.

guidou commented 9 months ago

Exposing mute reasons may expose private information that the user may not want to expose, for instance the fact that the user will take a phone call in the middle of a video conference meeting. I think it is ok to expose that information when the user is trying to unmute. I do not think it is ok to expose that information to any web page that has capture permission/did capture in the past.

I agree with that. Note that we're talking about pages with a live (but muted) track (not pages that captured in the past but are not currently capturing). I'm fine with restricting the API to that case.

I understand the desire to provide detailed mute reasons. I see it more as a UI optimisation than a blocker for mute management by web pages, given this information would be provided, just later in the game when user actually asks to unmute. At this point, I think it is fine to give precise information.

I'm fine with restricting more precise information at requestUnmute(), after a user gesture.

Hence why I would tend to not start with this but start with the unmute API. The errors we would expose in the unmute API would help us with unmute reasons, should we want to do this.

The problem is that the current muted attribute is not good enough for the application to decide when to call requestUnmute(). In that sense, having the two mute reasons @jan-ivar proposed will actually work, since more detailed information will be provided at requestUnmute time.

A. Unmuted B. Muted (actionable) C. Muted (unactionable)

I understand this characterisation and agree this is an improvement from a privacy perspective. We could probably implement further mitigations (say expose updated actionable state for visible web pages only for instance). But again, the web application could go to state B when the mute event is fired and go to state C when the web application asks to unmute and unmuting fails, with a more precise hint of what the actual issue is.

My initial thought is that if the application updates its UI as if requestUnmute() was going to be useful, and then it turns out it's not useful at all, it will be a really bad user experience. But I'm fine with the "actionable"/"not actionable" proposal (or a similar one, such as a new boolean attribute) + more detailed info on requestUmute().

Without a mute reason, requestUnmute is largely useless as the application won't be able to know when to call it.

@guidou, I do not really understand this, can you clarify why a web application won't be able to know when to call? I would think the web application would call this API whenever the user clicks on the resume capture button. Do you agree on the assumption that unmuting would be tied to a user gesture?

I agree on the assumption that unmuting would be tied to a user gesture. I strongly disagree with the assumption that the current muted attribute provides the application with enough information of when/how to offer a UI for that user gesture. We can fix it with a new boolean attribute (I might even call it muted if the name wasn't already taken), or with the "actionable"/"not actionable" preliminary mute reason, with extra details provided upon user-initiated unmute.

eladalon1983 commented 9 months ago

My initial thought is that if the application updates its UI as if requestUnmute() was going to be useful, and then it turns out it's not useful at all, it will be a really bad user experience. But I'm fine with the "actionable"/"not actionable" proposal (or a similar one, such as a new boolean attribute) + more detailed info on requestUmute().

We should consider that some apps might never wish to call requestUnmute(), which might involve a user-prompt by the UA or OS. They might instead prefer to offer a hint - "fix this in the OS" or "fix this in the browser". Having information ahead of calling requestUnmute() would help here. The concern around user-privacy, which would motivate us to not expose information ahead of calling requestUnmute(), does not strike me as credible; the user trusts the application to listen to their microphone(!) - knowing whether the user muted in the OS or UA is next to no information in comparison. (The app can surmise some mute anyway, given the silence.)

jan-ivar commented 9 months ago

@guidou, I do not really understand this, can you clarify why a web application won't be able to know when to call?

@youennf because apps today don't flip to image in response to mute. Users would be unable to flip it back, which is bad.

Apps likely want an assurance this was an (actionable) "user" mute before flipping to image, so that clicking on it is likely to work. Rejection from unmute() is too late to determine this.

Apps likely don't want unactionable (non-user) mute to affect their mute toggles at all (like today). The goal is to align browser toggles with in-content toggles.

It's really a separate userMuted state in my mind. But Safari already ships a pause UX (which I'd expect to fire actionable mute), so it seems more web compatible to add reason to mute than new usermute events at this point.

jan-ivar commented 9 months ago

If we solve this, then another source of actionable mute/unmute could be the lock screen on phones, if we fix https://github.com/w3c/mediasession/issues/279#issuecomment-1138926629.

eladalon1983 commented 9 months ago

@jan-ivar, thoughts about the solution here, under 1. Listen, which I intend to present on the coming interim? Note that "-choice" vs. "-issue" should be a good proxy for what's actionable vs. unactionable.

youennf commented 9 months ago

Rejection from unmute() is too late to determine this.

Apps will anyway need to deal with rejection cases, there might be races, there might be user decision changes... I don't really see much benefit in exposing a boolean about the likelihood of unmute success.

Apps likely don't want unactionable (non-user) mute to affect their mute toggles at all (like today). The goal is to align browser toggles with in-content toggles.

Safari UX is currently showing track muted/unmuted state, whether muting is caused by the user or the OS. If the web application wants to align with Safari UX, it does not need anything more than the muted boolean value.

From a user perspective, the most important information to convey is whether capture is on or off (off meaning that there is a guarantee It stays off without a user action). I am not sure to understand why the web application would want to present a capture-on icon while the UA UX and OS UX are telling the user that capture is off.

I'd like to understand what UA UX are envisioned for Chrome and Firefox in that space.

eladalon1983 commented 9 months ago

If the web application wants to align with Safari UX, it does not need anything more than the muted boolean value.

The mute value can reflect a temporary absence of media for any reason, including mere silence.

Quoting the spec: "Muted refers to the input to the MediaStreamTrack. If live samples are not made available to the MediaStreamTrack it is muted."

Therefore, the mute value is not a good indication for the Web application that the user muted an upstream entity like the browser or operating system. A Web application cannot spec-compliantly assume the mic/camera is muted and show relevant app-level UX based on mute as it is specified today.

I am not sure to understand why the web application would want to present a capture-on icon while the UA UX and OS UX are telling the user that capture is off.

That's exactly the misalignment we seek to remedy with a MuteReason or similar API.

youennf commented 9 months ago

A Web application cannot spec-compliantly assume the mic/camera is muted and show relevant app-level UX based on mute as it is specified today.

I might buy this definition. Note that this would be a mere boolean in that case (defined as something like UA thinks showing a muted state for the web page makes sense and probably does it on its own UI), not an enumeration.

Therefore, the mute value is not a good indication for the Web application that the user muted an upstream entity like the browser or operating system.

Whether the OS muted or the user decided to mute does not change the fact that the UA should present UI to the user that capture on the web page is muted.

I'd like to understand the cases you have in mind where capture for a web page would be muted but UA would show a live capture icon for this web page. AFAIK, there are no such cases in Safari right now, muted is all that is needed currently for web pages to update their own UI.

One case I know of is image capture but this seems very specific and we could instead decide to change the image capture spec. Could you list additional cases?

eladalon1983 commented 9 months ago

A Web application cannot spec-compliantly assume the mic/camera is muted and show relevant app-level UX based on mute as it is specified today.

I might buy this definition.

Great to hear. So let's discuss how to extend the spec. I proposed an enum. I still think it's a workable solution - I can discuss separately why I think the privacy properties of this proposal are good. But assuming this is not an ideal proposal, what is? Could you please clarify your suggestion of a boolean?

I'd like to understand the cases you have in mind where capture for a web page would be muted but UA would show a live capture icon for this web page.

Let's examine things not through the lens of macOS and Safari, but through the lens of a hypothetical spec-compliant UA running on some unknown OS. Both the UA and OS offer some form of mic/camera-muting capabilities. Distinctly.

Imagine the user is on Meet/Teams/Zoom and has given mic+camera access. The user has NOT muted anything in the UA. Now let's consider what happens when the user mutes the mic through the OS.

Should the UA change anything it shows the user about mic-access for the app? I believe not, or else the user unmuting the OS-level control would yield a result that could confuse the user - immediately giving the app mic-access again. But the video-conferencing app would want to show the mute-icon as changed. This is due to the following distinction:

Also - the app also wants to allow the user to press this muted-state icon, at which point the app may call requestUnmute(). This call to requestUnmute() would either go directly to the UA, or be relayed from the UA to a similar API in the OS, depending on which mute state(s) apply at that time (the UA knows, even if it does not reveal this info to the Web app).

One case I know of is image capture but this seems very specific and we could instead decide to change the image capture spec. Could you list additional cases?

I don't understand this question.

bradisbell commented 9 months ago

The mute value can reflect a temporary absence of media for any reason, including mere silence.

For what it's worth, as a developer this is unexpected. Mute, to me, implies that there is something muting the track, and not simply that silent samples were detected.

I think that is the key to this whole discussion... For something to be muted, something must have muted it. If something is doing the muting, then that something must be controllable. The end users expect to find these controls in several locations, including the web application, OS controls, media controls on keyboards, and buttons on headsets. In the ideal case, the decision to mute or unmute can be made from any of those control surfaces. Someone should be able to mute on their Bluetooth microphone, and unmute that microphone in the application.

Is there agreement on this principle?

guidou commented 9 months ago

A Web application cannot spec-compliantly assume the mic/camera is muted and show relevant app-level UX based on mute as it is specified today.

I might buy this definition. Note that this would be a mere boolean in that case (defined as something like UA thinks showing a muted state for the web page makes sense and probably does it on its own UI), not an enumeration.

Therefore, the mute value is not a good indication for the Web application that the user muted an upstream entity like the browser or operating system.

Whether the OS muted or the user decided to mute does not change the fact that the UA should present UI to the user that capture on the web page is muted.

I'd like to understand the cases you have in mind where capture for a web page would be muted but UA would show a live capture icon for this web page. AFAIK, there are no such cases in Safari right now, muted is all that is needed currently for web pages to update their own UI.

The confusion here is about the current muted attribute. If I understand you correctly, you are willing to buy the definition that the muted attribute as currently specified does not represent a value that the application can use to update the UI to indicate that the capture is muted. If so, then we are all in agreement here that we need a new attribute specifically for microphones and cameras.

The new attribute should be defined such that the application can use it to update its UI as you suggest. IIUC, your might accept a new boolean attribute for this. @jan-ivar proposed a similar one (a 2-value enum with 'actionable', and 'not-actionable' values) and @eladalon1983 is proposing an enum with more detailed values (or a set of such enum values).

Also, IIUC, you are OK with providing more detailed values upon calling requestUnmute(), which should be gated by a user gesture indicating intent to unmute.

Perhaps we should start discussing the potential values for the more detailed information and then decide how to make them available. I for sure would like to make a distinction between muting via UA controls, OS settings, and HW (e.g., a dedicated mute button). requestUnmute won't be able to unmute all of them, but being able to tell the difference allows the app to provide the user with guidance. I'm OK with the concept of gating values that are more privacy-sensitive behind a user gesture, but I'd like more details about those values and why they require more privacy restrictions than the ones already required for a live track.

jan-ivar commented 9 months ago

I like boolean, e.g. an abstract "actionable" / "unactionable" enum. But nothing more.

But I just realized I think we need it on the unmute event as well. So maybe "user" / "temporal" are better names?

Let me explain:

Examples of temporal mute/unmute might be bug 1598374 or where we dropped the ball on transient activation in gUM:

UA could choose to mute for the user if it detects a page accessing camera or microphone on pageload or without interaction.

It's an applications decision how to react to mute and unmute and how to combine that with the app's own enabled state and cam/mic toggles, so knowing whether the "user" caused the mute (or unmute!) or whether it's "temporal" seems useful.

E.g. many video conferencing lobbies default to the user's mute preferences from last time they joined โ€” so they're not dropped into that corporate-wide meeting with camera and mic unmuted for the world โ€” This is a case where a "user" unmute maybe should override that preference (and flip the enabled toggle), whereas a "temporal" unmute (concluding an earlier "temporal" mute) should probably not.

jan-ivar commented 9 months ago

The more I think about it I wonder if a separate track.userMuted state with usermute and userunmute events might be better for web compat, to not break apps oblivious to muteReason.

Or, maybe we could repurpose the togglecamera and togglemicrophone media session API for this? Isn't their purpose already to toggle the toggles?

E.g. (mic only for brevity)

youennf commented 9 months ago

I agree with @bradisbell about that the common case (aka user muted capture somewhere and wants a coherent view as well as be able to unmute easily). We can wait to fix the other mute reasons to a later time.

Perhaps we should start discussing the potential values for the more detailed information and then decide how to make them available.

I agree we should dig into this. I'd like to start working on requestMute and requestUnmute model, list the error cases, and learn from those. That should give us a good foundation to see what additional information we can and should provide at mute time.

But I just realized I think we need it on the unmute event as well. So maybe "user" / "temporal" are better names?

In terms of API shape, it would probably be better on MediaStreamTrack than on the event. Could you explain how web application would react to user vs. temporal?

Apps likely want an assurance this was an (actionable) "user" mute before flipping to image, so that clicking on it is likely to work.

This assumes that UA knows whether this is actionable or not at the time of mute. When mute is triggered by the OS, UA might only know whether this is actionable at the time of calling requestMute, not at the time it fires the mute event. I prefer web applications to cope by default on this uncertainty.

  • The app's UX speaks to the user of what it does with the audio, and whether it has access to audio.
  • The UA's UX communicates to the user what permissions the app currently has.

I don't know any UA (or OS) that shows such UX in that way. If these two UX do not represent the same thing but user is confusing them, UA should fix its UX. In general, I think we should base API design on what UA and OS do, not on what UA and OS might do in the future.

alvestrand commented 9 months ago

Is there an app use case for requestMute() that is not covered by track.enabled = false? I've seen that it's been mentioned on the thread (first time April 20), but searching the thread for "requestMute" unearthed nothing about its imagined use case.

youennf commented 9 months ago

Is there an app use case for requestMute() that is not covered by track.enabled = false?

The main concern is how much web compatible is muting when track.enabled = false. Web apps currently do not expect that setting track.enabled=false will trigger being muted. Some web apps will set track.enabled=true and will not know that they need to call requestUnmute.

There might be web applications that want to set enabled=false but do not want to be muted, for instance to keep the temporary capture permission that active capture provides.

Third, it is a convenience method which allows to synchronously mute capture by just one call (instead of getting to all tracks that are potentially living in different contexts like workers).

eladalon1983 commented 9 months ago

@eladalon1983 is proposing an enum with more detailed values (or a set of such enum values).

Not necessarily just an enum. Another possible approach is to have a few fields for different purposes, such as:

A MuteDetails class could collect these. Some of them might only be exposed if some conditions hold, like a successful call to requestUnmute(), or any future thing we introduce.

Concretely:

enum MuteSource {"unspecified", "user-agent", "operating-system", "hardware"};

interface MuteReason {
  readonly MuteSource source;
  readonly boolean potentiallyActionable;  // More details below.
};

partial interace MediaStreamTrack {
  sequence<MuteReason> getMuteReasons();
};

And the spec will say that mute is re-fired whenever sequence<MuteReason> changes and is non-empty, and unmute whenever it changes from non-empty to empty.

[...] to not break apps oblivious to muteReason.

Apps that assume all mute events are fired in response to an explicit user action are non-compliant (see earlier quote of spec). Both @jan-ivar and @youennf have in the past argued for the principle that we don't need to worry about breaking non-compliant apps. So we can safely reuse the generic mute event and expose MuteReasons directly on the track.

This assumes that UA knows whether this is actionable or not at the time of mute.

Which it often does. Examples:

  1. OS-level muting in ChromeOS.
  2. UA-level muting in any browser that provides them, such as Safari.

When mute is triggered by the OS, UA might only know whether this is actionable at the time of calling requestMute, not at the time it fires the mute event.

I propose that one of the MuteReason fields should be potential amenability. That is, anything where requestUnmute() might help with, given the currently available information. (Reasons why it might not work could include heuristics employed by the UA to avoid abuse.)

I don't know any UA (or OS) that shows such UX in that way. If these two UX do not represent the same thing but user is confusing them, UA should fix its UX.

I believe all UAs work like that. When Chrome shows that Meet is listening to your mic, Chrome cannot tell the user if the application is transmitting the audio remotely, processing it locally, or doing neither, allowing it to go straight to /dev/null. Chrome only tells the user - this app has access to your mic right now.

In general, I think we should base API design on what UA and OS do, not on what UA and OS might do in the future.

Could you please clarify what this statement refers to? Note that ChromeOS already shows mic/camera controls, and extending such controls to Chromium on arbitrary platforms is something that is frequently discussed.

alvestrand commented 9 months ago

Is there an app use case for requestMute() that is not covered by track.enabled = false?

The main concern is how much web compatible is muting when track.enabled = false. Web apps currently do not expect that setting track.enabled=false will trigger being muted. Some web apps will set track.enabled=true and will not know that they need to call requestUnmute.

I was trying to read the spec to figure out if a track will fire onmute or set muted=true when enabled=false. I couldn't find any language that linked the two; I may have missed it.

youennf commented 9 months ago

@guidou , you mentionned muted can be fired in Chrome for various reasons. Can you list some of them?

youennf commented 9 months ago

I propose that one of the MuteReason fields should be potential amenability.

So far, for Safari, all mute reasons would be considered actionable (provided I understand what actionable actually means). The only case I know of is https://www.w3.org/TR/image-capture/#dom-imagecapture-takephoto and I question the design here.

I'd like to understand in which existing cases Chrome or Firefox are firing mute events that would not be classified as actionable. That will help understanding/defining this notion and validate whether useful to define/expose or not. actionability or potential amenability is a fuzzy definition, I would prefer we avoid it if we can and instead converge on what mute means for capture tracks.

I was trying to read the spec to figure out if a track will fire onmute or set muted=true when enabled=false. I couldn't find any language that linked the two; I may have missed it.

This was mentioned in https://github.com/w3c/mediacapture-main/issues/642 to handle the case of setting track.enabled = true to restart capture (potentially executed by a page in the background, potentially after a long time without capture).