Add Support for Media-Encoded Events

tinskip commented 7 years ago

There are various types of events / cue points which may be encoded into media containers. Examples of these are MPEG-DASH 'emsg' boxes, and SCTE-35 'tones' in MPEG2-TS. Processing of media-embedded captions/subtitles might be another use case. Currently player applications have to parse media streams in order to be aware of these events, and retrieve the data encapsulated within them, which is inefficient at best. With the advent of MSE, the player app should not have to be aware of media container internals.

Adding support for these events in MSE would probably be best in the form of JavaScript events. Perhaps adding the ability to register handlers for specific events. Ideally the data sent to the handler would be parsed into some type of event-specific message, or perhaps a dictionary. But even receiving just the raw box data would be an improvement.

dwsinger commented 7 years ago

if we do this, we should probably unify this support with the support for handling text tracks with kind=metadata, as they are similar ways to handle things (text tracks are better for representing states that change, emsg and the like are better for handling events that happen)

wilaw commented 6 years ago

Voicing support for this feature request on behalf of the DASH Industry Forum, Akamai and the CTA WAVE project. WAVE specifically is interested in establishing a reliable in-band messaging workflow around EMSG with CMAF.

With MSEv1, JS players must parse incoming segments to look for embedded EMSG boxes. We would like a cleaner implementation in which the SourceBuffer performed all box parsing operations (since it is already parsing the incoming segments), freeing the JS application to manage only the logic of handling the events.

Will Law

nigelmegitt commented 4 years ago

This was also discussed at TPAC 2019 when I raised the idea of exposing subtitles/captions through MSE, and as I recall @mwatson2 was also enthusiastic.

chrisn commented 4 years ago

The DataCue proposal (see explainer) in WICG intends to support the DASH emsg and SCTE-35 use cases, as an extension of the existing timed text track support in HTML.

mwatson2 commented 4 years ago

Is this feature request addressed by the DataCue proposal ?

chrisn commented 4 years ago

Is this feature request addressed by the DataCue proposal ?

In part, yes. DASH emsg is part of DataCue, but media-embedded captions/subtitles are not.

chrisn commented 4 years ago

Exposing subtitles/captions through MSE is a proposed topic for the upcoming joint meeting on October 15 between Timed Text WG, Media & Entertainment IG, and Media WG (agenda).

JohnRiv commented 4 years ago

This is still of interest to CTA WAVE for MSE v2.

wolenetz commented 4 years ago

We need further information and concrete proposal for how to expose subtitles/captions through MSE (along with DataCue support as well for EMSG). Please assist editors in this regard.

wolenetz commented 3 years ago

In more detail, relative to these slides expected to be discussed at TPAC tomorrow, what is the precise mapping of the content in emsg to the content in the proposed DataCue?

In particular (and not limited to), what must a UA do to determine interoperably the following when encountering an emsg (note that emsg is a top-level box, at least as presented in those slides, and in the CMAF-specific version in those slides, there can be any number of emsg associated with a CMAF chunk (which is proposed to be a CMAF media segment). If emsg processing is specifically only supported in that scenario, the PTS delta I suppose could be determined, though since none of the preceding top-level CMAF chunk boxes are required at cardinality >= 1, one or more emsg could be partially appended by the JS app before the UA's MSE implementation recognizes that it is "parsing a media segment" and restricts MSE operations like setTimestampOffset. Therefore, it could be nondeterministic what timestamp offfset is applied to the emsg PTS delta to determine the start/end times in the generated DataCue.

irajs commented 3 years ago

In particular (and not limited to), what must a UA do to determine interoperably the following when encountering an emsg (note that emsg is a top-level box, at least as presented in those slides, and in the CMAF-specific version in those slides, there can be any number of emsg associated with a CMAF chunk (which is proposed to be a CMAF media segment). If emsg processing is specifically only supported in that scenario, the PTS delta I suppose could be determined, though since none of the preceding top-level CMAF chunk boxes are required at cardinality >= 1, one or more emsg could be partially appended by the JS app before the UA's MSE implementation recognizes that it is "parsing a media segment" and restricts MSE operations like setTimestampOffset. Therefore, it could be nondeterministic what timestamp offfset is applied to the emsg PTS delta to determine the start/end times in the generated DataCue.

The emsg has an start time value in its body. So while the parsing time of emsg may vary, since the earliest presentation time of the chunk carrying the emsg (for emsg v0) or the earliest time of media presentation (for esmg v1) is known (those information should be available for the MSE implementation as it parses the chunk that carries MSE and have the media presentation time start), then the start and end time of DataCue can be precisely calculated by the MSE implementation.

technogeek00 commented 3 years ago

@wolenetz Following up from the TPAC conversation with respect to emsg timing relation and MSE API functionality:

In both the V0 (time relative to segment/chunk) and V1 (presentation timeline fixed) cases I concur the application layer adjustment of the timestamp offset will need to be properly taken into account to ensure the execution of media time relative event surfacing
Since emsg can be arbitrarily placed from a generic ISO BMFF profile perspective, this may be a case where a stricter UA processing guidelines for CMAF profiles of ISO BMFF could be utilized to guarantee behavior consistency. i.e. if an init segment appended is recognized by the UA as having CMAF conforming signals, it would then be able to enforce guarantees of delivery based on well-formed structures. Alternatively you could just call the behavior of non-conforming streams unspecified, not sure how the approach is typically taken.
Assuming we can signal restricted processing to the UA as part of the MSE initialization, it can then depend on the emsg timing being related to the timing of the next appended moof
If segment appention were to be done such that the time the emsg is mapped to is spliced out, I concur that the emsg would be removed as part of that splice and not surfaced, expecting anything else is too inconsistent.

dwsinger commented 3 years ago

FYI, MPEG is (still) working on a spec. to put events into tracks, with media-related timing, etc.

chrisn commented 2 years ago

Discussion at TPAC 2022 (minutes):

The Chromium implementation for even in-band text has been removed from MSE. Implementation would be non trivial
Pull requests against the MSE or bytestream format specs would be helpful

chrisn commented 2 years ago

This topic has been discussed in DASH-IF Events Task Force, from 30 Sep 2022 meeting: More implementation experience using player libraries such as DASH.js is needed before being ready to pursue MSE integration.

w3c / media-source

Add Support for Media-Encoded Events #189