ISO BMFF Byte Stream format should support layered (scalable) encodings

mwatson2 commented 9 years ago

ISO/IEC 14496-15 describes the carriage of layered (scalable) encodings in ISO Base Media File Format. Examples include SVC and MVC.

Such layered encodings can be encoded within a single track, or with multiple tracks, for example one for each layer. In the multi-layer case, when Movie Fragments are used, there are two ways the data can be organized into movie fragments: (1) A single moof / mdat(s) pair can contain the data for the several tracks for each media segment (2) The several tracks can be split into several consecutive moof / mdat(s) pairs

Option (1) is supported by our existing MSE byte stream format, but option (2) is not because we require that each "media segment" consists of a single moof and mdat(s). Option (2) has advantage because, typically, the sequence of moof / mdat(s) containing the "base layer" can be processed by a device which does not understand the scalable encoding.

So, I propose we modify our definition of Media Segment for the ISO BMFF byte stream format to consist of a sequence of one or more ( moof, mdat (, mdat)* ) structures where:

all data referred to in each moof appears in the immediately following sequence of mdats
all presentation timestamps fall within the range specified in the first moof of the sequence

If this is agreeable, I'll prepare the PR.

mwatson2 commented 9 years ago

I have created a proposal for this issue: https://github.com/w3c/media-source/pull/8

wolenetz commented 9 years ago

Thanks for filing this, Mark. I am getting a response ready and should be able to share details (concerns and questions) by Sept. 18.

wolenetz commented 9 years ago

tl;dr: @mwatson2 and I discussed the origin of this feature request, and I don't think the current proposal is something we could use in MSE v.current, though we're working to find product-specific ways, perhaps outside of full MSE spec compliance to move forward.

Regarding this issue (and the associated pull request): This appears to be more than just a registry edit (details below). In fact, this is a feature request that requires multiple significant spec changes (MSE ISO BMFF, MSE, and HTML5). These changes will introduce significant delay in getting to PR and risk moving MSE backwards in the W3C process.

Hence, I recommend that we track this as a new feature request as part of a later version of MSE (as briefly discussed at the April 2015 f2f) and/or explore it in an incubator (or, of course, find an alternative standardizable and practical solution).

If the proposed multi-track approach is the only mechanism for multi-layer, this bug impacts and depends on changes to more than the MSE ISO BMFF byte stream spec:

The mechanisms for managing coherency across tracks, in at least the coded frame processing [1] and the coded frame eviction [2] algorithms in the MSE spec make no mention of any need for correlating coded video frames from across multiple tracks
The initialization segment received algorithm [3] is based on the underlying HTML5 spec: there can be at most one selected video track in the element's VideoTrackList [4].

The current proposal would need changes to [1-4], too.

[1] http://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing [2] http://w3c.github.io/media-source/#sourcebuffer-coded-frame-eviction [3] http://w3c.github.io/media-source/#sourcebuffer-init-segment-received [4] http://www.w3.org/TR/html5/embedded-content-0.html#dom-videotrack-selected

w3c / media-source

ISO BMFF Byte Stream format should support layered (scalable) encodings #7