w3c / media-source

Media Source Extensions
https://w3c.github.io/media-source/
Other
268 stars 57 forks source link

Coded frame processing is not clear around what to do for SAP Type 2 overlaps #187

Open wolenetz opened 7 years ago

wolenetz commented 7 years ago

SAP Type 2 (allowed for {audio,video}/mp4) random access points can have a frame later in decode sequence than the random access point, but with a lower presentation time than the random access point.

Suppose there is some buffered video (in presentation timeline) (caps=SAP Type 2 random access point, # = decode sequence) [a2][A0][a1]

And a new decode sequence (e.g., following an abort() or with a decode sequence discontinuous with previous append) is appended: [B0][b1][b2], having presentation time sequence [b2][B0][b1], with [b2]'s presentation interval overlapping the presentation time of [a1], but neither [B0] nor [b1] overlap any of the previously buffered video.

In the current coded frame processing algorithm, [a1] would be retained. Both [b2] and [a1] would have the same presentation time: which should be retained? Should the overlap of [a1] by [b2] cause [a1] to be removed? (And [a1]'s removal would also trigger the removal of [a2] because the it has decode dependency on [a1].) If you seek to the PTS of [b2] / [a1], which should show?

The coded frame processing algorithm already dictates what to do for the first frame following a decode discontinuity: if it overlaps the presentation start time of some previously buffered frame, remove that previously buffered frame and everything previously buffered in decode order from that removed frame until the next previously buffered random access point. It also maintains a track buffer highest end time for use in removing previously buffered media (from a previous coded frame group) overlapped by the new coded frame group as its presentation interval grows into the future.

However, for SAP Type 2 random access points allow a coded frame group's presentation interval to grow into the past, too.

This seems to strongly motivate also tracking a track buffer lowest start time, and adding overlap removal logic to the spec text.

This hasn't been a problem in Chrome yet due to it having incorrectly been using decode time intervals, not presentation time intervals, for managing all buffered range overlaps and reporting buffered range times; now that I'm fixing that implementation to be correct, I see that SAP Type 2 appears to be insufficiently supported by the MSE spec.

@jdsmith3000 @jyavenard @paulbrucecotton FYI - please comment. -edited to correct to @paulbrucecotton

wolenetz commented 7 years ago

@acolwell I did some spelunking through the repo history, and the text around supporting SAP Type 1 and 2 existed in the first commit: e657ebfcaca57afe40b36f3c834c8e359f1e5cca. Needless to say, it appears SAP Type 2 support (where PTS of later frame in decode order can precede PTS of random access point) has been allowed since MSE was proposed. Am I missing something? WDYT about adding lowest start time logic to the coded frame processing algorithm?

wolenetz commented 7 years ago

@plehegar @paulbrucecotton, would adding something like lowest start time be considered an erratum? It clarifies behavior for improved interoperability for a use case that has been required to be supported since MSE was originally proposed. -edited to correct to @paulbrucecotton

jyavenard commented 7 years ago

For H264, Firefox ignores the container's sample table flags, and instead uniquely rely on the frame containing a NAL of type IDR_SLICE

When we did rely on the container's sample table flags, we use the same logic as FFmpeg to determine if the frame depended on any others:

For all other codec (except audio, where we marked all frames as keyframe):

An old comment I wrote on the matter: "the tfhd box set default sample flags. The definitions for those flags are: The sample flags field in sample fragments (default_sample_flags here and in a Track Fragment Header Box, and sample_flags and first_sample_flags in a Track Fragment Run Box) is coded as a 32-bit value. It has the following structure: bit(4) reserved=0; unsigned int(2) is_leading; unsigned int(2) sample_depends_on; unsigned int(2) sample_is_depended_on; unsigned int(2) sample_has_redundancy; bit(3) sample_padding_value; bit(1) sample_is_non_sync_sample; unsigned int(16) sample_degradation_priority;

with an extra note "The flag sample_is_non_sync_sample provides the same information as the sync sample table [8.6.2]. When this value is set 0 for a sample, it is the same as if the sample were not in a movie fragment and marked with an entry in the sync sample table (or, if all samples are sync samples, the sync sample table were absent).”. Here we have no sync table (no stss box).

So we use the mask 0x01010000; if the operation sampleFlag & 0x1010000 gives us 0, we assume it’s a keyframe ."

I have never seen a case where the frames marked as random access point in the container weren't the first frame of the GOP (in both pts and dts order).

So in short, does it matter ? :)

wolenetz commented 7 years ago

I have never seen a case where the frames marked as random access point in the container weren't the first frame of the GOP (in both pts and dts order). So in short, does it matter ? :)

It would be excellent if this were truly the case (as it had been my assumption until early last week, when reparsing the ISOBMFF spec text around SAP Type 2 indicated that we must support this case you haven't ever seen). However, we've seen folks at least having problems with similar streams in non-MSE case, and we have seen some cameras that produce SAP Type 2 streams. I'll add some logging to Chrome to detect this case. Do you have any Firefox telemetry confirming your statement that no-one ever uses H264 with SAP Type 2?

acolwell commented 7 years ago

@wolenetz I believe I only included SAP Type 2 because I was just quickly looking to see which types mapped to "closed GOPs". I likely didn't pay attention to the fact that it allowed the presentation time of a later access unit to come before the first access unit. I suspect SAP Type 2 is pretty rare since it could cause all sorts of complications. My initial instinct is to just drop frames that have presentation times before the PTS of the first access unit, but I suspect if the encoder is sufficiently crazy this could break decode dependencies. I suppose you would also wiggle out of this by dropping the SAP Type 2 requirement since you likely didn't prove interop on it. :)

I suspect trying to support later access units with earlier presentation times will open a can of worms that just isn't worth solving. I can't really see a way to do it without having full dependency knowledge of the full GOP because the earliest frame could be the last one in decode order if you happen to have a particularly cruel encoder (say one that encodes frames in reverse order!).

jyavenard commented 7 years ago

The other difficulty added here is that without deep analysis of the h264 stream, there's no real way to know. All of the MSE specs imply that we work at the container/demuxer level. Which is completely content agnostic.

It would be really unfortunate to change this...

wolenetz commented 7 years ago

OK. I'll add some telemetry logging to Chrome and proceed under the assumption that SAP Type 2 occurrences, if they occur, are rare.

Based on the results of such telemetry and perhaps further discussion, we might want to change this issue to "Deprecate ISOBMFF SAP Type 2 support".

paulbrucecotton commented 7 years ago

would adding something like lowest start time be considered an erratum?

The W3C Process document clearly defines classes of changes. It would appear that this may be a Type 3 change ie one that might impact conformance.

From the current HME WG charter:

Note: As of May 2017, the HTML Media Extensions remains to make corrections that do not add new features to its Recommendation(s) accomplished within the scope of the charter.

HME WG is currently chartered to do only Type 1, Type 2 and Type 3 changes. If I am correct that your proposed change is a Type 3 change then it is scope of the HME WG.

/paulc HME WG Chair

cconcolato commented 7 years ago

FYI, in GPAC, we have never seen SAP Type 2 used. And IIRC, it was 'invented' during the DASH standardization within MPEG for the sake of completeness.

wolenetz commented 7 years ago

@cconcolato (https://github.com/w3c/media-source/issues/187#issuecomment-313625388) that is more excellent news. Thank you for providing that data point.

wolenetz commented 7 years ago

With https://chromium-review.googlesource.com/c/563017/2 applied to Chrome to detect SAP Type 2, spot-checking several sites showed that at least nest.com's desktop site uses SAP Type 2 + MSE; youtube and netflix desktop site spot-checks showed only SAP Type 1. So it appears SAP Type 2 is used at least in some cases. I'm not certain this warrants adding lowest end time case to fully support SAP Type 2 versus a deprecation warning. I've chosen the latter for the moment.

jyavenard commented 7 years ago

Upon further thought, the use of SAP type 2 will have no negative impact with firefox MSE stack. There's no need for the I-Frame to be the earliest presented frame of the GOP.

wolenetz commented 7 years ago

So far, preliminary (Chrome M61 - dev channel) shows a small % of occurrences of SAP-type-2 usage in MSE. However, this data is very preliminary, I'll update with public numbers once the related telemetry logging is in a more significant portion of Chrome usage.

jernoble commented 4 years ago

WebKit has received reports that this situation seems to be occurring regularly on CNN.com videos, e.g. this one. Overlapping appends of SAP type 2 content causes "green flashes" where the decoder tries to apply a "type 2" B-frame to a "type 1" I-frame.

jernoble commented 4 years ago

(Though technically, the content seems to be MPEG-2TS, not strictly ISOBMFF content. But since TS is supported as a MSE stream type, the issue still needs fixing somehow.)

wolenetz commented 2 years ago

Chrome has a user that reported SAP-Type-2 buffering causing gaps, due to buffering segments backwards in (chunk3,2,1 for example). Obtaining clarity in the spec would help implementations both support SAP-Type-2 in MSE and also improve interop for such media. See https://crbug.com/1008756