w3c / media-and-entertainment

Repository for the Media and Entertainment Interest Group
55 stars 15 forks source link

Compositable Video Content #103

Open AdamSobieski opened 11 months ago

AdamSobieski commented 11 months ago

In addition to sets of video tracks in containers and streaming solutions representing alternative renditions of content, let us consider describing sets of video tracks where those tracks are intended to be blended together by clients per alpha compositing.

Alpha composting is, today, most often performed server-side before content is streamed or broadcast by video production software.

Uses of client-side video layer compositing would include:

  1. Some video layers could be delivered with a variable bitrate, e.g., news anchors speaking, while other video layers could be delivered with an assured fidelity, e.g., text overlays, lower third content, and news tickers.

  2. Video players could automatically select which video layers to composite together on behalf of end-users. A client-side system could, for instance, automatically select a foreground layer from a set of alternatives based on a user's preferred language.

  3. Web pages could allow end-users to make use of menus to dynamically select which video layers to composite together. End-users could, for instance, select which variety of news that they would prefer for a news ticker composited atop other video content.

  4. Client-side compositable video layers could be a part of solutions for enabling interactive time-based hypermedia, e.g., hypervideo. Foreground layers could be capable of having hypertext selected from them or of otherwise being interacted with, e.g., clicked upon. As considered, this type of video layer might provide bounding boxes or silhouettes, perhaps with identifiers unique to the video container or stream.

A useful implementation concept could be that of "trees of video tracks". With such trees, we could express sets of video tracks for alpha compositing where some of these layers could have alternative renditions or reflowed renditions. Recursively, some of these layers could be intended to be rendered using client-side video layer compositing.

Perhaps, one day, video containers and streaming technologies will be able to combine client-side compositability with reflowability to provide video producers and consumers with new features and functionalities.

Thank you. I look forward to discussing these ideas with you.

ingararntzen commented 11 months ago

Hi @AdamSobieski

Please be aware there is another approach, so that video composition could be achieved without extensive changes to the media player. Similarly - the reflowability could be accomplished without changes to the distribution protocol.

The timing object proposal was also motivated by very similar use cases. The proposal is a bit old now, but key idea is remains relevant:

The key change required by the timing object proposal would be that media players were able to sync to a timing object. The Vimeo player now supports this, I think. When sync is supported, the remaining challenges are less daunting.

AdamSobieski commented 11 months ago

Thank you @ingararntzen for sharing the Timing Object proposal. I particularly like the following use cases and requirements:

  1. Social Viewing and Media Control
  2. Online Education - Timed Multi-device Web Presentations
  3. Multi-screen Data Visualization
nigelmegitt commented 11 months ago

Regarding:

other video layers could be delivered with an assured fidelity, e.g., text overlays, lower third content, and news tickers.

and

Foreground layers could be capable of having hypertext selected from them or of otherwise being interacted with, e.g., clicked upon. As considered, this type of video layer might provide bounding boxes or silhouettes, perhaps with identifiers unique to the video container or stream.

it's worth noting that the HTML model of the <video> element does not support compositing child content elements. As the spec says:

Content may be provided inside the video element. User agents should not show this content to the user; it is intended for older web browsers which do not support video, so that text can be shown to the users of these older browsers informing them of how to access the video contents.

Video players currently have to work around this by providing content in a separate part of the DOM and using styling to position that content over the video. In that sense the <video> element is unlike other content elements, arguably unhelpfully.

I think the use case is understandable, but I would suggest that delivering text with an "assured fidelity" can be an accessibility anti-pattern. It would be better to take advantage of the web platform to be able to allow client side rendering of especially text but also styled graphics over the video, so that their contents and semantics are visible to assistive technology, and so that users can modify the presentation to meet their accessibility needs, e.g. using a stylesheet.

AdamSobieski commented 10 months ago

Thank you @nigelmegitt. Accessibility is important to consider and compositable layers of hypervideo could be one approach to delivering these features.

@ingararntzen, I thought of some ideas combining HTML5, reflowable video, compositable video, and multi-video time synchronization. I recently found resizeable panels (demos available here) and we can imagine videos in some or all of the resizable panels of dynamic multi-panel layouts. These videos would each contribute to and combine together to provide resultant user experiences.

ingararntzen commented 10 months ago

@AdamSobieski Thanks. Yes, I agree. This could be a quite powerful way of presenting, and it could also be used to combine videos from different streaming protocols into the same experience. My research focus is on control mechanisms for Web experiences, so I'm thinking that the content provider should be able to control how panels are resized during a live broadcast (not only the viewer) - this in order to provide a lean-back experience. For instance, in sports, during a race incident, the provider might like to give priority to certain video angles.

This approach would also be quite webby, as it allows panel resizing and panel-to-content mapping to be addresses as separate issues, and then just be interconnected in the interface.