video-dev / media-ui-extensions

Extending the HTMLVideoElement API to support advanced player user-interface features
MIT License
29 stars 8 forks source link

Live Edge Window #6

Closed cjpillsbury closed 1 year ago

cjpillsbury commented 1 year ago

Overview & Purpose

The Problem

For live/"DVR" content, it's common to have some indication as to whether or not they are currently playing "at the live edge". However, due to nature of HTTP Adaptive Streaming (HAS), the live edge cannot be represented as a simple point/moment in the media's timeline. This is for a few reasons:

  1. At the manifest/playlist level, the available live segments are typically known by the client by periodically re-fetching these using in-specification polling rules. These files may also be updated by the server in a discontiguous manner as segments are ready for streaming. Together, since the client does not know when segments will be added by the server, the known advertised "true live edge" will "jump" discontiguously through this process, which needs to be accounted for as a plausible "range" or "window" for what counts as the live edge.
  2. HAS provides segmented media via a client-side pull based model (most typically, e.g., a GET request), where each segment has a duration. This means that a client must first "see" the segment (via the process described above), then GET and buffer the segment, and then (eventually) play the segment, starting at its start time. Here again, this entails a discontiguous, per-segment update of the timeline, which again needs to be accounted for via a "range" or "window", rather than a discrete point.
  3. In order to avoid live edge stalls, both MPEG-DASH and HLS have a concept of a "holdback" or "offset," which inform client players that they should not attempt to fetch/play some set of segments from the end of a playlist/manifest. Luckily, this can be treated as an independent offset calculation applied to e.g. the seekable.end(0) of a media element, which can then be used as a reference for any other live edge window computation.

(Visual representation may help here)

A concrete sub-optimal (not worst case) but in spec example - HLS:

Let's say a client player fetches a live HLS media playlist just before the server is about to update it with the following values:

# ...
# Unfortunately, EXT-X-TARGETDURATION is only an upper limit (>= any EXTINF duration) after rounding to the nearest integer
#EXT-X-TARGETDURATION: 5
# Client side "LIVE EDGE" will be 5.46 seconds into the segment below, aka 3 * 5 (target duration) = 15 seconds from the playlist end duration
# NOTE: Assume playback begins at the beginning of the segment below, since some client players choose to do this to avoid stalling/rebuffering, meaning playback starts -5.46 seconds from the "LIVE EDGE"
#EXTINF:5.49
#EXTINF:4.99
#EXTINF:4.99
#EXTINF:4.99

The server then ends up updating the playlist with two larger-duration segments (in spec and happens under sub-optimal but not unheard of conditions) before the client re-requests the playlist after 4.99 seconds (the minimum amount of time the player must wait) and continues re-fetching the available segments, with an updated playlist of:

# ...
#EXT-X-TARGETDURATION: 5
# NOTE: Current playhead will be 4.99 seconds into the segment below, assuming optimal buffering and playback conditions at 1x playback speed
#EXTINF:5.49
#EXTINF:4.99
#EXTINF:4.99
# New Client side "LIVE EDGE" will be 0.97 seconds into the segment below, aka 3 * 5 (target duration) = 15 seconds from the playlist end duration
#EXTINF:4.99
#EXTINF:5.49
#EXTINF:5.49

In this example, playback started 5.46 seconds behind the computed "LIVE EDGE" and, after a single reload of the playlist, ended up 11.45 seconds behind the next computed "LIVE EDGE" without any stalls/rebuffering. Note that, even in this example, we do not account for round trip times (RTT) for fetches, time to parse playlists, times to buffer segments, initial seeking of the player's playhead/currentTime, and the like. Note also that, even without those considerations, the playhead still ends up > 2 * TARGETDURATION behind the "LIVE EDGE".

The solution

Since this information can be derived from a media element's "playback engine"/by parsing the relevant playlists or manifest, the extended media element should have an API to advertise what the live edge window is for a given live HAS media source. Call this the "live window offset"

Additionally, due to consideration (3), above, we should treat the seekable.end(0) as the end time of a live stream accounting for the per-specification "holdback" or "delay".

Proposed API

Constrained meaning of seekable.end(0) as "live edge" (with HOLD-BACK/etc) for HAS

To account for the distinction between the live edge duration of the media stream as advertised by the playlist or manifest vs. the latest time a client player should try to play, based on per-specification rules and additional information also provided in the playlist or manifest, extended media elements SHOULD set the seekable.end(0) value to account for this offset. This shall be assumed for all computations of the "live edge window", where seekable.end(0) will be the presumed "end" of the window/range, already taking into account the aforementioned offset. With these offsets presumed, seekable.end(0) may be treated as synonymous with a client player's "live edge" and these terms should be treated as interchangeable in this initial proposal.

For RFC8216bis12 (aka HLS)

  1. "Standard Latency" Live

seekable.end(0) should be based on the inferred or explicit HOLD-BACK attribute value, where:

HOLD-BACK

The value is a decimal-floating-point number of seconds that indicates the server-recommended minimum distance from the end of the Playlist at which clients should begin to play or to which they should seek, unless PART-HOLD-BACK applies. Its value MUST be at least three times the Target Duration.

This attribute is OPTIONAL. Its absence implies a value of three times the Target Duration. It MAY appear in any Media Playlist.

  1. Low Latency Live

seekable.end(0) should be based on the explicit PART-HOLD-BACK (REQUIRED) attribute value, where:

PART-HOLD-BACK

The value is a decimal-floating-point number of seconds that indicates the server-recommended minimum distance from the end of the Playlist at which clients should begin to play or to which they should seek when playing in Low-Latency Mode. Its value MUST be at least twice the Part Target Duration. Its value SHOULD be at least three times the Part Target Duration. If different Renditions have different Part Target Durations then PART-HOLD-BACK SHOULD be at least three times the maximum Part Target Duration.

For ISO/IEC 23009-1 (aka "MPEG-DASH")

  1. "Standard Latency" Live

seekable.end(0) should be based on the explicit MPD@suggestedPresentationDelay (OPTIONAL) attribute, when present, otherwise it may be whatever the client chooses based on its implementation rules. Per the spec:

it specifies a fixed delay offset in time from the presentation time of each access unit that is suggested to be used for presentation of each access unit... When not specified, then no value is provided and the client is expected to choose a suitable value.

  • From §5.3.1.2 Table 3 - Semantics of MPD element

(NOTE: there may be additional suggestions/recommendations available via the DASH IOP)

  1. Low Latency Live

seekable.end(0) should be based on the ServiceDescription -> Latency@target attribute. Note that this value is an offset not of the manifest timeline, but rather of the (presumed NTP or similarly synchronized) wallclock time. Per the spec:

The service provider's preferred presentation latency in milliseconds compared to the producer reference time. Indicates a content provider's desire for the content to be presented as close to the indicated latency as is possible given the player's capabilities and observations.

This attribute may express latency that is only achievable by low-latency players under favourable network conditions.

(NOTE: This implies that the value could change marginally over time based on precision and other wallclock time updates based on the runtime environment. However, since these differences should be minor, it's likely fine to treat this value as static for the case of this document and can likely be implemented as such in an extended media element)

liveWindowOffset

Definition

An offset or delta from the "live edge"/seekable.end(0). An extended media element is playing "in the live window" iff: mediaEl.currentTime > (mediaEl.seekable.end(0) - mediaEl.liveWindowOffset).

Possible values

Recommended computation for RFC8216bis12 (aka HLS)

  1. "Standard Latency" Live

liveWindowOffset = 3 * EXT-X-TARGETDURATION

Note that this is a cautious computation. In many stream + playback scenarios, 2 * EXT-X-TARGETDURATION will likely be sufficient. However, with this less cautious value, there may be edge cases where standard playback will "hop in and out of the live edge," so recommending the more cautious value here.

  1. Low Latency Live

liveWindowOffset = 2 * PART-TARGET

Unlike "standard" segments (#EXTINFs), parts' durations must be <= #EXT-X-PART-INF:PART-TARGET (without rounding). Also unlike "standard," HLS servers must add new partial segments to playlists within 1 (instead of 1.5) Part Target Duration after it added the previous Partial Segment. This means that, even under sub-optimal conditions, low latency HLS should end up with a much smaller liveWindowOffset.

Recommended computation for ISO/IEC 23009-1 (aka "MPEG-DASH")

TBD

Open Questions

  1. What should we actually call the property?
    • In https://github.com/video-dev/media-ui-extensions/issues/4, we decided to call the numeric value representing a live "DVR" window the targetLiveWindow. Since this value represents a window for the "live edge" and not for "available live content to seek through/play", having both refer to the "live window" will likely be confusing. In the current related preliminary implementation in Media Chrome, we refer to the related attribute as the livethreshold. Should that be the name here as well? Do we want the name to try to capture the fact that this is an "offset" value from the "live edge"/seekable.end(0)?
  2. Distinct event or repurposed event?
    • The above proposal makes no mention of a corresponding livewindowoffsetchange event. While we cannot likely rely on any of the built in HTMLMediaElement events, we should be able to guarantee computation of the relevant values before dispatching the streamtypechange event, as documented in https://github.com/video-dev/media-ui-extensions/issues/3. Is this repurposing of the event acceptable? Should we consider a more generic event name that more clearly relates to states announced for stream type, DVR, live edge window offset, and potentially additional future properties/state?
heff commented 1 year ago

This is a great writeup. 👍 Love the additions to HLS/DASH to support this cleanly. I don't understand all the implications of those specifically, but I'm sure others can weigh in.

Since this is the media-ui-extensions I think it's worth laying out the UI problems that are being solve more. It's gonna be hard for others to really evaluate the API without that clear context. As a summary pass:

seekable.end(0)

seekable.end(seekable.length-1), right? Not sure if it's actually possible for seeking but it definitely matters for buffered.

liveWindowOffset

Definition and name feel good to me. 👍

The most accurate/verbose name might be liveEdgeWindowOffset, right? In order to avoid confusion with anything else that might be considered a "live window". I feel like we should never refer to the DVR window specifically as "live". It's intentionally "(R)ecorded", not live, once you get behind the live edge. i.e. "Live + DVR" feels more accurate than "Live DVR".

Is this repurposing of the event acceptable?

If we can't point to any real reasons why this number might change otherwise, it feels good to try to bundle it as a starting point at least. Either that or we just say that every new state gets its own change event, and be done with it. I could go either way. The latter would remove friction in any specific independent proposal.

cjpillsbury commented 1 year ago

I feel like we should never refer to the DVR window specifically as "live". It's intentionally "(R)ecorded", not live, once you get behind the live edge. i.e. "Live + DVR" feels more accurate than "Live DVR".

Not sure what you mean here. Per your suggested name here https://github.com/video-dev/media-ui-extensions/issues/4#issuecomment-1344924246, our current mostly settled proposal on modeling DVR will rely on a property called targetLiveWindow. Is this you changing your mind? Am I misunderstanding something here?

gkatsev commented 1 year ago

seekable.end(0)

seekable.end(seekable.length-1), right? Not sure if it's actually possible for seeking but it definitely matters for buffered.

Yes, it should be seekable.end(seekable.length-1), though, in practice they're going to be equivalent in most implementations for live streams.

cjpillsbury commented 1 year ago

seekable.end(seekable.length-1)

@heff - Per @gkatsev's callout, I believe this will always be identical to seekable.end(0) in browser implementations, but you're right, we might as well avoid unnecessary presumptions here and always refer to it as seekable.end(seekable.length-1)

heff commented 1 year ago

Is this you changing your mind? Am I misunderstanding something here?

No, you're right to be confused. :) From this context targetLiveWindow now sounds more misinterpretable. We're at least clearly using 'liveWindow' to mean two different things between the proposals now, and that's not great. I don't think we have to go change targetLiveWindow, but if we don't I'd lean towards something like liveEdgeOffset here instead. An alt for targetLiveWindow otherwise might be targetSeekableWindow. Open to either path, we should just avoid the double meaning.

cjpillsbury commented 1 year ago

we should just avoid the double meaning.

Agree 💯. I'm going back and forth on your rename proposals. As each hints at, the problem is both scenarios are about "windows" and both are related to live. One is the "live seekable window;" the other is the "live edge window." They're also both offsets. Since Names Are Hard™️, I'm leaning towards liveEdgeOffset. It unfortunately looses some context by dropping "window" that may introduce some ambiguity/confusion, but I think that'll be true for any renaming.

heff commented 1 year ago

A couple of additional thoughts:

With that I like targetLiveSeekableDuration. Also open to Window.

For the live edge, would it be better to do liveEdgeStart? Feels like the most common operation is going to be:

if (currenTime > seekableEnd - liveEdgeWindowOffset) {
  // show red light
}

When it could just be:

if (currenTime > liveEdgeStart) {
  // show red light
}

With that we could lean on progress or durationchange events for updates. Or just timeupdate would even be fine.

luwes commented 1 year ago

I might be missing something but why not just HTMLMediaElement.getLiveSeekableRange() ?

that covers both #4 and #6 in one familiar API. it also a bit similar to how HTMLMediaElement and MediaSource both have duration

I def also am on board with that the naming should be close to that LiveSeekable naming

cjpillsbury commented 1 year ago

I might be missing something but why not just HTMLMediaElement.getLiveSeekableRange() ?

@luwes responded in the PR to keep the conversation there, both in comments and by updating the proposal to hopefully add some clarity. The short version:

There are actually two distinct "live windows" we're modeling in #4 vs. here.

  1. The "Seekable Live Window," which is the range of presentation times a player may "seek to", either programmatically or via a UI. (this is primarily captured in #4)
  2. The "Live Edge Window," which is the range of presentation times that should be treated as counting as "the live edge," effectively a "fudge factor" to account for the segmented, pull-based nature of HAS standards. (this is primarily captured in the proposal here and its corresponding proposal PR)
luwes commented 1 year ago

okay I thought somehow that this is true

start livestream time -> liveSeekableStart -> liveSeekableEnd -> real seekable.end(seekable.length-1)

and the proposed liveEdgeOffset = seekable.end(seekable.length-1) - liveSeekableEnd

is this not the case?

cjpillsbury commented 1 year ago

@luwes No that's not quite right. Check out the diagram I added here https://github.com/video-dev/media-ui-extensions/pull/7/files?short_path=6415912#diff-6415912cbdb551127eb5975514c274cb87904befd9ca77ec25808f682ab492d7 ("Diagram with HLS reference values for context") and let me know if that clears things up. Also, if you could, let's move the conversation to the PR to try to follow Gary's process.

cjpillsbury commented 1 year ago

Closing this Issue per our discussed process to avoid multi-channel conversations. Can re-open if corresponding proposal PR is rejected.