w3c / media-source

Media Source Extensions
https://w3c.github.io/media-source/
Other
267 stars 57 forks source link

Sample Flags - Correctness vs Reality #221

Closed itsjamie closed 5 years ago

itsjamie commented 5 years ago

In debugging a few issues, I've started to look at how encoders are defining sample_flags at the ISOBMFF level, and noticed through working with content and talking with others in debugging this content that often times the data was incorrect. So I looked at what would go into defining the information as correct as possible.

However, what I found was that many encoders were taking advantage of default_sample_flags in the tfhd box, and first_sample_flags in the trun box.

This works for achieving playback, but through talking with @wolenetz and debugging a rise of failures related to our MP4 metadata in the Chrome implementation, we were notified that some frames were incorrectly marked as not keyframes when they in fact were.

In our case, verbosely defining the sample_flags for every sample entry lead to a box size increase of ~10% (600 bytes) per segment and was also of an unknown value as some demuxers only look for the first keyframe currently.

I believe that currently multiple trun boxes could be defined for each GOP, but in my investigation, I didn't see any encoder that output fragmented MP4 in this way. Curious if anyone knew why?


Alternatively, I wanted to run by here, and if it makes sense raise for a revision to 14496-12 for including into a future revision of the spec.

The proposal is to add a new flag to the trun box, indexed-sample-flags-present. It would be treated as mutually exclusive to sample-flags-present, and first-sample-flags-present.

It would be an optional field + variable field array combination of the type:

unsigned int(32) indexed_sample_count
{
   unsigned int(32) sample_index;
   unsigned int(32) sample_flags;
}[indexed_sample_count]

I believe it would require a new minor_compliance brand in the same way as default-base-is-moof. This would allow a single trun box to define specific samples that had non-default flags easily. I believe striking a balance.

Thank you for any consideration of this.

jyavenard commented 5 years ago

Attempting to change the mp4 container to get around bad content is I believe a moot exercise as browsers are moving away from using any container informations to determine a frame type, but instead parsing the raw bit stream. Firefox is already doing so, and I believe Chrome is in the process of doing so too.

itsjamie commented 5 years ago

@jyavenard to be clear I'm not trying to "get around" bad content.

I'm trying to strike a balance of producing content which is spec compliant to a fault (many extraneous bytes that no one uses) but includes all the information necessary for playback with a decode error rate hopefully approaching zero percent.

For example, we had a test mux done that verbosely defined every sample-flag to correct values. However, the first-sample-flag and its value were left in accidentally, and this confused the box parsing code in Edge (which is spec compliant). The content wouldn't play and failed silently. This content however played in every other browser implementation.

If all MSE implementations are already moving away from reading MP4 metadata for keyframe-ness, would that be something that could be included as part of the specification for ISOBMFF parsing for MSE?

We were looking for examples of others who were producing good content at the MP4 level and could not find an open or closed source encoder which encoded ISOBMFF content that either defined verbose sample flags or, through another mechanism, correctly defined every sync-sample as such. It appears there is a defacto industry standard to not define the MP4 metadata completely, to ensure that the first sample flag is correct and that the rest are set to the unknown values.

michaeljweaver commented 5 years ago

@itsjamie Did you find any examples of others producing good content at the MP4 level? I've found in the last stable Chrome release the MSE streams I have play the first video chunk and then the video hangs. There's debug saying some frames are incorrectly marked as keyframes, but apart from that no errors. So I'm assuming this is what has caused the MSE player to hang.

wolenetz commented 5 years ago

@michaeljweaver it's likely there is buffered range gap produced as a result. Chrome stable recently has ramped up to 99% of sessions using spec-compliant buffering-by-PTS (not DTS) ranges. That buffering relies on correct keyframeness metadata to manage GOP presentation intervals, and since so much of content previously had mp4 metadata for avc keyframeness that mismatched the reality embedded in the actual AVC coded frames, at the same time as switching to buffering-by-PTS, Chrome also now trusts the AVC keyframe metadata over the MP4. tl;dr: Please inspect your media element's buffered range and currentTime when playback stalls. It's likely there is a gap. If you believe the behavior is incorrect, please file a https://crbug.com.

michaeljweaver commented 5 years ago

Thanks @wolenetz, yes you are correct there is a buffered range gap produced as a result. There are two frames that are not buffered correctly. I'm also getting the debug message "Warning: presentation time of most recently processed random access point (0.08 s) is later than the presentation time of a non-keyframe (0 s) that depends on it. This type of random access point is not well supported by MSE; buffered range reporting may be less precise." Which lead me to this thread https://github.com/w3c/media-source/issues/187

On further investigation, I think the MSE stream is SAP Type 2, where the decode order is IBBPBB, but the presentation order is BBIBBP. So the PTS of the first two B frames precedes that of the random access point. As you say in this thread this hasn't been a problem in Chrome yet due to it incorrectly using decode time intervals for managing all buffered range overlaps. Do you think the recent changes in Chrome stable could have caused SAP Type 2 streams to no longer play? It's not clear on if this is supported or not anymore, although the message suggests it isn't.

wolenetz commented 5 years ago

@michaeljweaver yes, as noted by #187, the MSE spec doesn't guide implementations around handling the buffering/overlap removal/etc of the portion of a SAP-Type-2 GOP's presentation interval prior to that GOP's keyframe. In Chrome, in a continuous (in DTS) series of coded frames that comprise one or more SAP-Type-2 GOPs that are themselves directly adjacent (that is, the GOPs are adjacent and sequential and don't overlap) in PTS intervals, then they will be buffered continuously, though the very first GOP's start time will be reported as the that GOP's keyframe PTS. This is a compromise made in the implementation to support known SAP-Type-2 streams while the spec remains ambiguous. Hopefully, this is enough information to unblock you. If not, please file a https://crbug.com (and if you can populate the component as Internals>Media>Source, that will reach me more directly).

michaeljweaver commented 5 years ago

@wolenetz thanks, I have raised https://bugs.chromium.org/p/chromium/issues/detail?id=954115 The component was a dropdown which didn't have Internals>Media>Source as an option so I set to Blink>MediaStream. I mention in the bug report, today I tried hacking the tfdt values so they were 2 frames less than what they should be (chunk 1 0.92, 2 is 1.84, 3 is 2.76 etc) This means I can get continuous playback from the start of the stream but can see frames being dropped during playback.

wolenetz commented 5 years ago

@michaeljweaver Thank you for reporting. Let's continue the discussion of this specific case on that crbug.

wolenetz commented 5 years ago

Closing the loop, @michaeljweaver, it was indeed SAP-Type-2 GOPs causing the problem. Especially when the GOPs are appended out-of-order (e.g. GOP3, then GOP1, then GOP2, etc.), in the worst case, each SAP-Type-2 GOP's preceding nonkeyframes may be of sufficient duration to prevent gap coalescence in some MSE implementations like Chrome as a result of #187 not being fixed. Workaround in Chrome, at least, would be to append them in order GOP1,2,3, etc.: so long as the DTS sequence is continuous across those GOPs (and the GOPs presentation intervals don't overlap), then they should be buffered continuously without gaps. A longer term spec fix as an alternative or even complementary to #187 is #160. I expect to put forth a proposal to solve at least the standardization of the gap tolerance/coalescence portion of #160 soon.