w3c / media-source

Media Source Extensions
https://w3c.github.io/media-source/
Other
268 stars 58 forks source link

Discourage support of inband codec configurations #218

Open chcunningham opened 6 years ago

chcunningham commented 6 years ago

The ISOBMFF registry has forever included this paragraph under "Initialization Segments"

The user agent MUST support codec configurations stored out-of-band in the sample entry, and for codecs which allow codec configurations stored inband in the samples themselves, the user agent SHOULD support codec configurations stored inband.

Followed by the note:

For example, for codecs which include SPS and PPS parameter sets, for maximum content interoperability, user agents are strongly advised to support both inband (e.g., as defined for avc3/avc4) and out-of-band (e.g., as defined for avc1/2) storage of the SPS and PPS.

While this definitely supports a wider breadth of content, it also creates a problem of how to describe such streams prior attempting to play them. If an inband configuration differs from the initialization segment, the UA may determine mid-playback that it can no longer support the stream.

If config changes are always done out of band we can fail at append-time if a configuration is not supported. And, to the extent the config is described by the source buffer's content type string, we may even fail upon calling isTypeSupported before downloading any part of the file.

FWIW, Chrome has historically not supported in band config changes. We trust the init segment. We do not attempt to detect/signal in-band changes from the decoder. Media using in band signalling may still work, but YMMV.

I'd like to change the quoted text to indicate that, while inband changes MAY be supported, they are discouraged for interop reasons and to avoid playback failures.

chcunningham commented 6 years ago

@wolenetz

KilroyHughes commented 6 years ago

There is ambiguity in the term "configuration change" that is obvious in adaptive streaming, where each video track is usually encoded with a different "configuration" but is expected to seamlessly parse, decrypt, decode, and display in the same media pipeline "configuration".

In practice, an initialization segment is selected that will initialize decryption, decoding, display buffers, refresh rate, etc. at a high enough level to process all media segments in the Switching Set. If a decoder is initialized at e.g. level 5 and subsequent segments only require level 4, that doesn't require a configuration change. However adaptive scaling requires changing the parameter set indexed by the video so the decoder will correctly crop the active samples from the encoded blocks so they can be seamlessly scaled to the selected display aperture. Temporal scaling (i.e. downshifting from 60Fps to 30Fps) requires a different "configuration" change to double sample durations while maintaining the same refresh rate. Chrome routinely supports these "configuration" changes that are common in adaptive streaming.

There are many parameters in an ISOBMFF Header that could change at a switch or a splice, but no standard for which changes should be seamlessly rendered. A configuration change like instantiating a new MSE buffer or a different or higher level codec is usually not fast enough to be seamless. Given a sufficiently robust player design, anything could be seamlessly switched and spliced (just use multiple decoders or fast decoders with long output buffers and live latency, and splice the decoded samples). But, for a few billion deployed devices and browsers that use MSE/EME to splice segments of compressed bitstreams, lots of constraints are necessary for seamless playback while avoiding configuration changes that would interrupt decoding, decryption, or rendering.

The problem for MSE media pipelines is not knowing when the configuration change is major or minor when it only sees an appended Header that could be the result of a seamless switch or splice, or a splice with parameter changes that require a significant reconfiguration. If the MSE buffer is destroyed and a new one instantiated, the media pipeline knows exactly what to do, but it takes too much time on too many devices to be an acceptable user experience.

From what I've heard, most MSE media pipelines today assume Headers in MSE only provide the small parameter changes used for adaptive streaming, so don't do any media pipeline reconfiguration (other than spatial and temporal scaling I mentioned). When there are additional parameter changes at a splice, pipelines sometimes crash because that was the wrong assumption. There's no agreement between encoders or decoders on what changes can be handled seamlessly within an initialized configuration, or a way for a scripted player that knows a splice from a switch to tell the media pipeline via an MSE API (although I've seen a proposal). Most live streaming with ad insertion is accomplished today by re-encoding the splices or the whole stream so that the video bitstream is continuous and only relies on correct handling of the parameter changes required by adaptive switching. "Real" splicing of ads and programs is avoided because it crashes or isn't seamless.

One available solution is to encode inband parameters so that the parameters that change during adaptive switching are contained in each segment, and there is no need to append a header, except when additional parameters need to be changed at a splice (using the same decoder configuration to maintain seamless presentation). Type 1 players have more options because the code doing the decoding is also doing the manifest processing, switches and splices, so it can make better decisions on what segments to request so it can splice seamlessly and what parameters to change without rebuilding the media pipeline, resetting HDMI and HDCP, etc.

jyavenard commented 6 years ago

One available solution is to encode inband parameters so that the parameters that change during adaptive switching are contained in each segment, and there is no need to append a header, except when additional parameters need to be changed at a splice (using the same decoder configuration to maintain seamless presentation).

Most of the steps define in the MSE specs specifically request an init segment to be added. Firefox/Gecko like Chrome expect an init segment to be sent to detect change of configuration. For h264, that init segment doesn't have to contain a SPS/PPS, those can be inband. But the presence of a new init segment is required, otherwise we would consider that content to be invalid.

Some frameworks (Like Windows media foundation, Android or FFmpeg) can deal with inband change just fine. Apple's VideoToolbox decoder can't, it only understand AVC1 content, the SPS/PPS must be provided out of band, during initialisation of the decoder. As such, you need to know before hand that the stream has changed, that's what the init segment helps doing.

KilroyHughes commented 6 years ago

My main point was that a change in AVC/HEVC cropping parameters shouldn't be considered a "configuration change" that needs to be signaled with an initialization segment. The AVC/HEVC elementary stream parser should always reference the indexed parameter set in the stream, when present. That is how MPEG-4 Part 15 is specified ('avc3' and 'hev1' sample formats).

Above, I mentioned some reasons why inband parameter processing is important at the system level, but to summarize; it is important for low latency live streaming of splice conditioned content (like OTT TV) and providing equivalent functionality to M2TS, which is currently more functional than ISOBMFF for those applications (because it always uses inband parameters and single initialization).

It is of practical importance to many services that the 'avc3' and 'hev1' elementary streams (modulo startcodes) can be repackaged on-the-fly between M2TS and ISOBMFF and both formats play splice conditioned streams as though they are continuous, i.e. with the same "configuration" instantiated based on the first init segment processed in what may be a long running "channel" of spliced content.

jyavenard commented 6 years ago

My main point was that a change in AVC/HEVC cropping parameters shouldn't be considered a "configuration change" that needs to be signaled with an initialization segment

Maybe not from your point of view, but seeing that some platform framework (Apple VT) do not support them, in effect they must be.

chcunningham commented 6 years ago

@KilroyHughes, much of your first post reads like a feature request for sourceBuffer.changeType(). Perhaps you've seen that proposal - it expands MSE's ability to seamlessly transition across codecs/containers.

+1 to jyavenard's Apple VT comment. Additionally...

In practice, an initialization segment is selected that will initialize decryption, decoding, display buffers, refresh rate, etc. at a high enough level to process all media segments in the Switching Set.

It's possible to craft an SPS that makes potentially breaking changes. For instance the SPS could signal a changed profile_idc or new constraint_set flags. When these changes break decoder support, the pipeline needs to fall back to a different decoder. With an init segment, the pipeline is notified in advance and can perform fallback or notify the app that thew new content is not supported. With inband, only the current decoder knows something changed, causing playback failure if it cant support the new config.

For me this issue is primarily about avoiding that late failure. As it stands I think the spec suggests that inband SPS/PPS changes "should" be able to reconfigure the decoder, just as if the config had come from a new init segment. In practice this has never been implemented (which hopefully means no one is attempting to use it) and, for reasons above, I can't imagine we would implement it in the future. Low latency OTT is possible using init segments - folks have implemented this already.

KilroyHughes commented 6 years ago

Encoding must be constrained to decode and decrypt without configuration changes whether SPS adaptive encoding changes are signaled in the header or in movie fragments (inband). Changing the codec profile in the superset direction would break playback whichever location SPS/PPS was stored (sample entry or prior to the first IDR NAL inband).

Content formats like CMAF specify the encoding constraints between Tracks for seamless adaptive switching with spatial and temporal scaling in a single presentation. CTA WAVE, DASH IF, ATSC, DVB, etc. specify encoding constraints for seamless playback of splices between presentations. Splice constraints today are very restrictive because there hasn't been a splice spec, so there's little consistency between devices as to what decoder, decryptor, and display changes in the content they must render seamlessly or what Header parameter changes they'll detect and respond to, and what kind of gap each change will cause.

In an MSE player, if a header is appended on every switch and splice, what happens at a splice isn't predictable. As a result, most OTT live streaming and ad insertion today is accomplished by re-encoding spliced 'avc1' content to approximate a single continuous file, aside from media timestamp discontinuities at most splices, which can be set via the MSE API.

Re-encoding ("splice conditioning") is done to avoid reconfiguration of the decoder, decryptor, display interface, and MSE buffer; any of which takes too much time to be seamless. MSE players don't have huge video frame buffers like Apple Type 1 HLS players (several seconds of decoded pictures) that can maintain video output while the media pipeline is re-configured in response to an HLS discontinuity tag.

It would be better if a splice could be initiated by sourceBuffer.changeType() and regular Header append always interpreted as an adaptive switch with parameter changes restricted to spatial and temporal scaling.

Even without sourceBuffer.changeType() today, inband SPS eliminates the need to append a Header for adaptive switching, so Header appends can be reserved for splices and allowed to change parameters like timescale, trackID, default_KID, edit list, etc. that don't require decoder, decryptor, or display reconfiguration.

Almost all Apple streaming to date and all OTA broadcast streaming has used inband parameters wrapped in M2TS. M2TS with inband parameters is roughly half the adaptive streaming content on Android and Windows, and nearly all on Apple. Those elementary stream parsers are reading the inband video parameters. Inband parameters in M2TS proved their value for live streaming, ad insertion, etc. long before Microsoft introduced ISOBMFF files for adaptive streaming, and before MPEG specified the M2TS elementary stream format in ISOBMFF ('avc3' and 'hev1' sample formats). If you are talking about the limited support for MSE in Safari for some whitelisted apps, I wouldn't recommend that as the design goal for MSE or ISOBMFF content. Apple is supporting CMAF (ISOBMFF) playback in their embedded HLS player, and support seamless splicing of CMAF/CMAF and CMAF/M2TS signaled by a discontinuity tag and new header URL. HLS splicing doesn't have the constraints that MSE does (except that HLS playback of CMAF requires Fragment start and end alignment at splice points and a common timeline origin for all Tracks, rather than allowing per Track or Switching Set presentationTimeOffsets).

Inband parameters work in all streaming use cases (assuming conformant AVC and HEVC stream parsers). There have been discussions in application consortia to require inband parameters to allow flexible workflow and reduce the extra engineering, test cases, etc. required to support parameter storage in file headers. Relying on parameters stored in the file header generally works for playing one file after another where some presentation gap at the splice point is acceptable. Video parameters stored only in the header can be problematic for live streaming and seamless ad insertion, especially programmatic ad insertion, and repackaging M2TS elementary streams, which is required in many ingest, HLS delivery, and cable network scenarios.

mwatson2 commented 6 years ago

@KilroyHughes The distinction you are trying to draw between "splicing" and ordinary stream switching is artificial. One person's "splice" is another person's stream switch and vice versa.

The point is just that the configuration must be delivered to the player ahead of the new stream data. Whether that is in-band or in an initialization segment is a detail where I think the most relevant considerations are the ones raised by @chcunningham.

Either way, the various considerations you mention must be dealt with. Creators must create content that can be played seamlessly by the players they target and player capabilities in this respect need to be known or discoverable. It would certainly be good to have minimum industry standards here and therefore raise the lowest common denominator, but this standard is not "CMAF single initialization", which doesn't even support protected adaptive streaming with both HD and UHD streams available.

mstattma commented 6 years ago

Looking at the original issue, isn't the opportunity here to rather constrain inband codec configuration to a) follow an initial initialization segment which sets up the pipeline and b) only signal changes to spatial and temporal scaling and/or profile subsets (i.e. prohibit changes in superset direction) than discourage it's use?

That would enable optimizations to avoid codec re-inits for seamless switching and prevent late failure (with proper spliced conditioned content).

It would however be necessary to allow the player to detect such capability.

johnsim commented 6 years ago

I believe this thread would benefit from contributions from live linear streaming service engineers, broadcasters and vMVPDs, who have a variety of reasons for how they want/need to encode content.

I agree it is preferable to know up front what you can/can't play, and that signaling back from the decoder when an in-band change is encountered breaks proper layering, but it is also important to understand what these services actually require, and why.

We should vet any proposed encoding constraints with linear streaming engineers, to hear there real-world response. I will share this issue to the content spec task force in CTA WAVE where this is a topic of discussion.

I will also add a few MCAPI issues which relate back to this topic.

haudiobe commented 6 years ago

@mstattma In the CTA WAVE Device Playback, we have exactly introduced the concept of a CMAF Master Header that initializes the pipeline. During playback you can now use either: 1) append CMAF Headers 2) use inband parameters

MSE is not clear about initializing the pipeline such that changes can be done later. This may need to be addressed either in MSE directly, in the byte stream format or in the media decoder. Today, it not done anywhere consistently, which creates the problem. There needs to be a way in MSE to initialize the pipeline with an envelope of all (expected/upcoming) options. We use this assumption in CTA WAVE DCP.

JohnRiv commented 4 years ago

Noting that this is still of interest to CTA WAVE

wolenetz commented 4 years ago

Scope of the work involved seems unclear to me at the moment. Further discussion is appropriate before assigning milestone.

wolenetz commented 4 years ago

For instance, how might apps discover up-front the capabilities of an implementation's ability (or lack thereof) for supporting a range of streams? For instance, what parameter(s) to MediaSource.isTypeSupported(..) call(s) would assist apps in proactively understanding whether or not an initialization segment containing a "CMAF Master Header" (or whatever is the more current abstraction) is actually supported by the implementation?