Provide implicit remote control access and media interaction by media type

richtr commented 9 years ago

HTMLMediaElement can be used to play out various types of media such as music, notification sounds, WebRTC streams and alarms. If we could differentiate between these different types of media content then user agents would be able to a.) provide contextual remote control access depending on the 'type' of media playing out and b.) enforce interactions between different media content (e.g. ducking music when a notification sound plays out or pausing all other media when a WebRTC voice call begins).

HTML media elements should be able to describe the 'intent' of their media content with the following API addition:

// Values taken from https://developer.android.com/reference/android/media/AudioManager.html
enum HTMLMediaKind { "" /* empty string */, "alarm", "dtmf", "music", "notification", "ring", "voice" };

partial interface HTMLMediaElement {
  attribute HTMLMediaKind kind;
}

The default kind value is an empty string. The kind attribute is limited to only known values of HTMLMediaKind.

A media element's kind content attribute can be declared in HTML:

<video src="short_ping.webm" kind="notification">

or a media element's kind attribute can be set in JavaScript:

<script>
  var myAudio = document.createElement('audio');
  console.log(myAudio.kind) // --> ""
  myAudio.src = "audio.mp3";
  myAudio.kind = "music";
  // myAudio.outerHTML === "<audio src="audio.mp3" kind="music"></audio>" === true
</script>

Any interaction between different kinds of media content can then be handled by each user agent on a platform-by-platform basis. For example, a desktop browser may define interactions between the different types of media elements according to the following table:

(This table must always read from the left column first. For example: 'notification →' 'ducks' 'music ↑' should be read as: "when a new notification media element reaches the playing state the user agent should duck all music media elements currently in a playing state.")

	default ↑	alarm ↑	dtmf ↑	music ↑	notification ↑	ring ↑	voice ↑
default →	-	-	-	-	-	-	-
alarm →	`pauses`	`pauses`	`pauses`	`ducks`	`ducks`	`ducks`	-
dtmf →	-	-	`pauses`	-	-	-	-
music →	-	-	-	`pauses`	`ducks`	-	-
notification →	-	`ducks`	-	`ducks`	-	`ducks`	`ducks`
ring →	`ducks`	`ducks`	`pauses`	`pauses`	`ducks`	-	-
voice →	`pauses`	`pauses`	-	`pauses`	`pauses`	-	-

By introducing .kind, we get the following capabilities:

we are able to implicitly explain current HTMLMediaElement behavior on both mobile and desktop. For example, 'default' on mobile has a different meaning today to 'default' on desktop (i.e. Desktop browsers allow multiple media elements to play out at the same time. Typically, mobile does not.).
we can differentiate different kinds of media usage and we would be able to define the interactions between them (as per the table above by way of an example). A 'music' media element would interrupt (pause) other 'music' elements playing. 'voice' would interrupt (pause) everything except other telephony related features. 'dtmf' would play out along side any other currently playing media elements except other 'dtmf' elements. 'notification' would duck any currently playing 'music' element. The full interactions between different media types is the subject of further work (the table above is just to get the ball rolling!).
we will not need to introduce a .remoteControls attribute. We would implicitly provide the most suitable remote controls and interaction based only on the kind of media currently playing out. For example, 'notification', 'dtmf' and 'ring' media elements would not obtain remote control or soft key interface access but they should still play nicely with other media content types. For 'alarm' we would be able to provide e.g. an alarm-type remote control interface. 'music' would be provided with e.g. a music-type remote control interface.

Media categorization could also be applied to AudioContext and MediaController objects as well as MediaSession.

Pinging @foolip, @jernoble, @marcoscaceres for their comments.

marcoscaceres commented 9 years ago

This is and interesting proposal, but I'm worried about having different privileges for different types - I fear everyone would just lie or screw it up (and browsers would be forced to treat everything as "default"). Also, these seem to be outside the use cases we've been discussing - and seem to leap native, which we've stated a few times should not be a goal (only parity). My opinion is that we should stick to just controlling simple audio/video for now.

I'd like to see us more fully explore MediaSession before we talk about anything else.

richtr commented 9 years ago

I fear everyone would just lie or screw it up (and browsers would be forced to treat everything as "default").

If each media type has specific behavior then web developers can choose the most appropriate based on the behavior they want. The trick is to find a suitable set of categories that match the experience users would expect which seems different for e.g. alarms vs. music vs. notifications vs. webrtc-based audio/video.

Also, these seem to be outside the use cases we've been discussing - and seem to leap native, which we've stated a few times should not be a goal (only parity).

This seems to be consistent with native capabilities on iOS (see: audio session modes) and Android (see: [streamType](https://developer.android.com/reference/android/media/AudioManager.html#requestAudioFocus%28android.media.AudioManager.OnAudioFocusChangeListener, int, int%29)). Each of these APIs allow media to be identified as belonging to a particular class of usage.

I also wonder if we are missing a couple of use cases. With ServiceWorkers I could set a wake-up alarm. How would that be displayed in the notification tray? Another use case would be that the user may typically expect all currently playing out media to pause once they join a WebRTC call. Is it worth documenting and addressing these use cases too?

My opinion is that we should stick to just controlling simple audio/video for now.

Happy to do that. Though we lose some subtlety in the interaction between web apps and different types of media usage.

I'd like to see us more fully explore MediaSession before we talk about anything else.

Sounds good. Given current platform limitations having something tied to observable platform media feels like the baseline for any solution right now.

foolip commented 9 years ago

I think we probably ought to be able to distinguish between different kinds of audio, to get parity with native platforms. In particular ducking for notifications seems impossible to achieve otherwise, without heuristics based on media duration or something. In the MediaSession proposal, the obvious place to put this is on MediaSession itself.

marcoscaceres commented 9 years ago

@foolip, @richt, can you articulate the use case for the distinction a bit more abstractly (without the API proposal) and send a PR to the whatwg repo's README.md describing why it's needed? Giving concrete examples of how iOS and Android use this would be extremely helpful (as well as how web pages would make use of this in practice).

marcoscaceres commented 9 years ago

err, @foolip I mean (fixed typo above, sorry)

foolip commented 9 years ago

https://github.com/whatwg/media-keys/pull/4 has some uses cases which could be solved by distinguishing between at least "normal", "notification" and "voice" kinds.

richtr commented 9 years ago

I've updated this issue's original description with more details of how it works.

There is also some precedent in the platform for this approach. The HTMLTrackElement interface takes a .kind attribute, limited to only known values, in the same way this issue proposes for HTMLMediaElement.

I haven't seen any proposal on how MediaSession will handle this yet. Is there any further input on this?

richtr / html-media-focus

Provide implicit remote control access and media interaction by media type #6