Open tinskip opened 7 years ago
if we do this, we should probably unify this support with the support for handling text tracks with kind=metadata, as they are similar ways to handle things (text tracks are better for representing states that change, emsg and the like are better for handling events that happen)
Voicing support for this feature request on behalf of the DASH Industry Forum, Akamai and the CTA WAVE project. WAVE specifically is interested in establishing a reliable in-band messaging workflow around EMSG with CMAF.
With MSEv1, JS players must parse incoming segments to look for embedded EMSG boxes. We would like a cleaner implementation in which the SourceBuffer performed all box parsing operations (since it is already parsing the incoming segments), freeing the JS application to manage only the logic of handling the events.
Will Law
This was also discussed at TPAC 2019 when I raised the idea of exposing subtitles/captions through MSE, and as I recall @mwatson2 was also enthusiastic.
The DataCue proposal (see explainer) in WICG intends to support the DASH emsg and SCTE-35 use cases, as an extension of the existing timed text track support in HTML.
Is this feature request addressed by the DataCue proposal ?
Is this feature request addressed by the DataCue proposal ?
In part, yes. DASH emsg is part of DataCue, but media-embedded captions/subtitles are not.
Exposing subtitles/captions through MSE is a proposed topic for the upcoming joint meeting on October 15 between Timed Text WG, Media & Entertainment IG, and Media WG (agenda).
We need further information and concrete proposal for how to expose subtitles/captions through MSE (along with DataCue support as well for EMSG). Please assist editors in this regard.
In more detail, relative to these slides expected to be discussed at TPAC tomorrow, what is the precise mapping of the content in emsg to the content in the proposed DataCue?
In particular (and not limited to), what must a UA do to determine interoperably the following when encountering an emsg (note that emsg is a top-level box, at least as presented in those slides, and in the CMAF-specific version in those slides, there can be any number of emsg associated with a CMAF chunk (which is proposed to be a CMAF media segment). If emsg processing is specifically only supported in that scenario, the PTS delta I suppose could be determined, though since none of the preceding top-level CMAF chunk boxes are required at cardinality >= 1, one or more emsg could be partially appended by the JS app before the UA's MSE implementation recognizes that it is "parsing a media segment" and restricts MSE operations like setTimestampOffset. Therefore, it could be nondeterministic what timestamp offfset is applied to the emsg PTS delta to determine the start/end times in the generated DataCue.
In particular (and not limited to), what must a UA do to determine interoperably the following when encountering an emsg (note that emsg is a top-level box, at least as presented in those slides, and in the CMAF-specific version in those slides, there can be any number of emsg associated with a CMAF chunk (which is proposed to be a CMAF media segment). If emsg processing is specifically only supported in that scenario, the PTS delta I suppose could be determined, though since none of the preceding top-level CMAF chunk boxes are required at cardinality >= 1, one or more emsg could be partially appended by the JS app before the UA's MSE implementation recognizes that it is "parsing a media segment" and restricts MSE operations like setTimestampOffset. Therefore, it could be nondeterministic what timestamp offfset is applied to the emsg PTS delta to determine the start/end times in the generated DataCue.
The emsg has an start time value in its body. So while the parsing time of emsg may vary, since the earliest presentation time of the chunk carrying the emsg (for emsg v0) or the earliest time of media presentation (for esmg v1) is known (those information should be available for the MSE implementation as it parses the chunk that carries MSE and have the media presentation time start), then the start and end time of DataCue can be precisely calculated by the MSE implementation.
@wolenetz Following up from the TPAC conversation with respect to emsg
timing relation and MSE API functionality:
emsg
can be arbitrarily placed from a generic ISO BMFF profile perspective, this may be a case where a stricter UA processing guidelines for CMAF profiles of ISO BMFF could be utilized to guarantee behavior consistency. i.e. if an init segment appended is recognized by the UA as having CMAF conforming signals, it would then be able to enforce guarantees of delivery based on well-formed structures. Alternatively you could just call the behavior of non-conforming streams unspecified, not sure how the approach is typically taken.emsg
timing being related to the timing of the next appended moof
emsg
is mapped to is spliced out, I concur that the emsg
would be removed as part of that splice and not surfaced, expecting anything else is too inconsistent.FYI, MPEG is (still) working on a spec. to put events into tracks, with media-related timing, etc.
Discussion at TPAC 2022 (minutes):
This topic has been discussed in DASH-IF Events Task Force, from 30 Sep 2022 meeting: More implementation experience using player libraries such as DASH.js is needed before being ready to pursue MSE integration.
There are various types of events / cue points which may be encoded into media containers. Examples of these are MPEG-DASH 'emsg' boxes, and SCTE-35 'tones' in MPEG2-TS. Processing of media-embedded captions/subtitles might be another use case. Currently player applications have to parse media streams in order to be aware of these events, and retrieve the data encapsulated within them, which is inefficient at best. With the advent of MSE, the player app should not have to be aware of media container internals.
Adding support for these events in MSE would probably be best in the form of JavaScript events. Perhaps adding the ability to register handlers for specific events. Ideally the data sent to the handler would be parsed into some type of event-specific message, or perhaps a dictionary. But even receiving just the raw box data would be an improvement.