Support 3D space (360°/VR/XR) as target presentation environment

In the past years more and more applications show up that show media content in 3D space, like 360° videos (stereoscopic or not), VR experiences, etc.. Subtitles (if present) are mostly shown at the bottom center in the current field of view. Another option that is used sometimes is that they are burned into the video at three different positions (bottom, evenly spaced) such that one of the three subtitle is always visible to the viewer (at least partly).

I think that there is more to subtitle representation in 360°/VR/XR than that. We investigated subtitles for 360° in the imac project (imac-project.eu).

As for today, an established way of presenting subtitles in 360°/VR/XR doesn't exist. This requirement is still a very general one and needs further studies.

A standardized solution would be great. There is some activity already in MPEG (MPEG-I, OMAF: https://mpeg.chiariglione.org/standards/mpeg-i/omnidirectional-media-format). The topic was discussed during the last TPAC meeting and as a follow-up action I created two issue in the W3C XR Community Group: 1) Use case description for subtitles in 360° videos (https://github.com/immersive-web/proposals/issues/39 2) Overview of requirements https://github.com/immersive-web/proposals/issues/40

@skynavga You put a ttml3 label on this issue. Do you see ttml3 as a document that will still be published in 2019?

I'd like to provide some further details on this requirement.

The use cases described here are still valid: https://github.com/immersive-web/proposals/issues/39

The major aspect of this requirement is to standardize the way how to refer TTML subtitles (2D coordinate system) to a 3D environment. Spatial information could be described with two angles (azimuth, elevation) and a depth information or as a 3D vector (x, y, z).

Possible solutions: The current implementation in the ImAc project (where use cases where investigated) works in the way that a 2D plane is put into a 3D scene where the 2D plane is used as root container region for IMSC subtitles.

In two of the use cases the 2D rendering plane is always positioned in the current field of view. We added custom attributes to the p-element to indicate the location of the corresponding speaker. That information is used by the player implementation to indicate the speakers position (only horizontally) either with an arrow or with a radar-like indicator. Example for p-element:

<tt:p xml:id="p11" region="R1" style="S2" begin="00:01:50.080" end="00:01:51.160" 
   imac:equirectangularLong="20">
     <tt:span style="S3">(David) Can you hear us?</tt:span>
</tt:p>

The value of "imac:equirectangularLong" describes the longitute angle of a speaker, where "0" refers to the center of a corresponding equirectangular video. Value range [-180; 180], positive values: left of the center.

In the third use case, the 2D subtitle plane is positioned somewhere in the 3D scene and is not attached to the field of view (a bit like a burned in text). Example for p-element:

<tt:p xml:id="p11" region="R1" style="S2" begin="00:01:50.080" end="00:01:51.160" 
   imac:equirectangularLong="20" imac:equirectangularLat="5">
     <tt:span style="S3">(David) Can you hear us?</tt:span>
</tt:p>

In this case, the two values provided by "imac:equirectangularLong" and "imac:equirectangularLat" describe the latitude and longitude of the center of the 2D subtitle plane.

Note: An exact definition for such spatial values is needed. For instance they could refer to the speaker, to a position where the subtitle should be rendered (that would not be the spreakers face) or to a point (e.g. center) of a 2D rendering plane. In the current implementation we did, that was irrelevant.

Another possibility would be to refer the IMSC file to the equirectangular video texture and position subtitles as for any 2D context. When rendering video and subtitle, the same mapping would be applied for video and subtitle in order to transform the 2D "texture" to a sphere in a 3D environment. This has not been implemented yet as far as I know.

The Timed Text Working Group just discussed Support 3D space (360°/VR/XR) as target presentation environment tt-reqs#8, and agreed to the following:

SUMMARY: Group discussed this, generally supportive, contingent on effort being available to make it happen.

The full IRC log of that discussion

<nigel> Topic: Support 3D space (360°/VR/XR) as target presentation environment tt-reqs#8
<nigel> github: https://github.com/w3c/tt-reqs/issues/8
<nigel> Nigel: Thank you for dialling in Peter, I'm going to take what you say as coming from Andreas as IRT for the purpose of IPR.
<nigel> Nigel: Summarise the requirement please?
<nigel> Peter: The work comes from a research project with a couple of other broadcasters, investigating
<nigel> .. accessibility services in VR/360º environments.
<nigel> .. We tried to put a workflow together with existing standards.
<nigel> .. There are a few gaps. The one corresponding to subtitles I presented here. The first requirement
<nigel> .. I put there is formulated quite generally.
<nigel> .. The main issue is we need to find a way to refer 2D space in IMSC into a 360º or 3D environment.
<nigel> .. It can be done technically, that's not a problem, we did it with some implementations and user testing
<nigel> .. but there's no standardised way to do it.
<nigel> .. We know there is more and more 360º content on the web or VR stores. They sometimes have
<nigel> .. subtitles but the quality is often very poor so we see a need to improve it and bring the features that
<nigel> .. IMSC has already into that new environment.
<nigel> .. The requirement is that a way is provided that 3D environments can be used as a target rendering area,
<nigel> .. relating the 2D subtitle into the 3D environment.
<nigel> Nigel: Is that representing a position in 3D space?
<nigel> Peter: Yes that's the main thing needed, and to describe what this position could be exactly.
<nigel> .. For example if we link the centre of the rendering container to a 3D environment.
<nigel> .. There are different approaches for that, and this is what my addition to this requirement is about.
<nigel> .. I described what we are doing and there is another approach I described.
<nigel> Glenn: Peter, we have in TTML2 appendix H on root container region semantics. We would have to find a
<nigel> .. way to merge or integrate 3D into this model to make effective use of what is being proposed here.
<atai1> q+
<nigel> .. Maybe it would be useful for you to review that appendix and make some suggestions for how to tweak
<nigel> .. the model to include the Z axis, if we are talking about Euclidean geometry.
<nigel> Pierre: It could be spherical or cartesian models.
<nigel> Glenn: Is the proposal primarily spherical?
<nigel> Peter: We used spherical angles to describe it but it is a matter of translation from one to the other.
<nigel> .. We have a concrete implementation but it doesn't really matter how this is expressed. There are good
<nigel> .. reasons for how. We need good definitions for how to standardise.
<nigel> ack a
<nigel> Andreas: The main part in terms of time is to focus on the decision if we want this requirement as a new
<nigel> .. feature in TTML3.
<nigel> .. 1. Do we understand the requirement?
<nigel> .. 2. Is there enough backing from content providers?
<nigel> .. 3. Are there implementations?
<nigel> .. For the second part there is a need in the industry - maybe Vlad can come in. For implementation Peter can say more.
<nigel> Vlad: To try to add more clarity. Looking at this issue, the use cases span over 360º video and immersive
<nigel> .. reality. For subtitles, if we consider 360º and 2D being basically the same, you paint an object that
<nigel> .. occludes anything else in the video. If you do exactly the same in immersive reality domain, when you
<nigel> .. occlude something behind an object, and the occlusion is located behind the object then that's the kind
<nigel> .. of occlusion that creates conflict in the human brain - when it happens it breaks down the perception
<nigel> .. of the stereoscopic scene.
<nigel> .. This is why depth position is such an important requirement for subtitles.
<nigel> Andreas: You are also in the VR industry forum that has the issue to resolve.
<nigel> Vlad: That is not a standards organisation, it helps guide the industry, but has a different scope.
<nigel> Nigel: Is the requirement to identify a 3D location associated with a content element independently of
<nigel> .. the region?
<nigel> Vlad: Yes that is one of the requirements.
<nigel> Nigel: Is it enough to associate zero or one 3D position and allow the implementation to do the presentation to meet user requirements?
<nigel> Vlad: Yes, that is definitely one piece of the puzzle.
<nigel> .. Also the content provider may want to associate an on-screen object with the character in the video.
<atai1> q+
<nigel> .. For example if I want an avatar per person in the screen I may want to define that in addition to the
<nigel> .. location, and my content may want to use that icon to associate the symbol with something outside
<nigel> .. the viewport.
<nigel> .. Another use case is that the content creator would want to use something as well as a directional arrow
<nigel> .. to identify where the speaker is.
<nigel> Andreas: Can we agree there is a requirement to add some specification text, vocabulary and semantics
<nigel> .. to position a content element or region in 3D space?
<nigel> Cyril: I'm not opposed, I'm wondering where the most appropriate place is, a separate module or appendix H?
<nigel> Glenn: We have defined a pan audio property which is effectively an angle on the horizon, if you multiply
<nigel> .. the range of -1 .. 1 by pi you get the full range. I wonder if that could be used to support the first of
<nigel> .. the two solutions that were suggested here. I realise we defined it for audio not 3D per se.
<nigel> Nigel: I'd almost go the other way and support audio positioning in 3D space rather than the 1D pan parameter.
<nigel> Glenn: I would support making this a module and it may require modification to appendix H also.
<nigel> Andreas: Is this TTML3?
<nigel> Nigel: I think it needs TTML3 and IMSC as it has been put.
<nigel> Pierre: I think IMSC is a longer path. There's a desire to put things in TTML before IMSC where possible.
<nigel> .. If we want to go down that path TTWG is going to need to coordinate closely with other organisations,
<nigel> .. I would hate to be inconsistent with other efforts. We need a champion for this. MPEG has a significant
<nigel> .. effort there.
<nigel> Peter: Yes we recently sent from the VR industry forum a letter to MPEG asking questions about their
<nigel> .. suggestions for handling subtitles in 3D. They also support IMSC in their OMAF draft and I think there
<nigel> .. are still some gaps so they try to provide that link from their side. It is true that needs to go hand in hand.
<nigel> Pierre: The first step here is to write down what we want to do and make sure it is consistent or if not then
<nigel> .. it is for a good reason, then once there is interest adding to IMSC is a simple editing task.
<nigel> .. There's a lot of work to do first.
<nigel> .. If someone is wiling to do the work, then yes, it's interesting.
<nigel> Glenn: This would definitely be a module.
<nigel> Andreas: IRT would possibly champion this. We already made a start with Peter working with other
<nigel> .. standards organisations in the direction Pierre requested. We need to figure this out but possibly we
<nigel> .. will try to push it in this group.
<nigel> Vlad: I think that similar to previous discussion there are two separate issues. First is to agree on the
<nigel> .. requirement and then how. It sounds like we are all in agreement this is a requirement to adopt and
<nigel> .. how is a secondary issue.
<nigel> Nigel: That's a nice summary, thank you Vlad. Any dissent on that?
<nigel> Glenn: IRT would provide an editor for a module?
<nigel> Andreas: No we haven't decided how to do that. I cannot commit to it yet.
<nigel> .. Can we defer the editor discussion for now?
<nigel> Pierre: OK then this is 2019 contingent on an Editor.
<nigel> .. I don't mean someone to do typing, I mean caring about it, aligning with other efforts etc.
<nigel> .. Someone to really make it work.
<nigel> .. I should say a more generic champion for this rather than an Editor.
<nigel> Peter: For me it is hard to estimate how much work it would be so I cannot commit to it now, which maybe
<nigel> .. is what Andreas is saying, it's in our interest to follow up on this.
<nigel> .. I see that people start to do it in one way or another and probably within the project and now after
<nigel> .. it only works in a closed environment. It would be good to avoid letting different variations appear, which
<nigel> .. is happening already.
<nigel> Pierre: I share your concerns!
<nigel> Andreas: Can we proceed with what Nigel proposed? We have interest but cannot commit today.
<nigel> SUMMARY: Group discussed this, generally supportive, contingent on effort being available to make it happen.

SUMMARY: Group discussed this, generally supportive, contingent on effort being available to make it happen.

Update: @tairt has volunteered to take the lead on this and offers editor capacity.

Recent discussions with stakeholders and expert have shown that requirements reflected in this issue are important and need to be addressed but also that scope and details need further discussion in a broader context. @TrevorFSmith has scheduled a call of the W3C Immersive Web Community Group on May 21st where requirements for subtitles in immersive environments will be discussed (see https://github.com/immersive-web/proposals/issues/40 for the discussion).

In the light of this development I am confident that some first specification text can be drafted still this year, but that it will be to early to have a final publication by the end of 2019.

Thank you for the update @tairt - very helpful.

See the discussion on the issue proposal of the WebXR community group (https://github.com/immersive-web/proposals/issues/40#issuecomment-496124420 downwards for the discussion after the remote meeting with the XR community group).

The latest research has shown that the most urgent requirement is the display of the subtitles that is always in the field of view of the user (see https://github.com/immersive-web/proposals/issues/40#issuecomment-511315747). This would require static rendering of subtitles on a 2D plane and is very much already supported by IMSC. What may be missing is to define this as a requirement for XR devices where the perspective of the content changes (through user movements) but the subtitles need to be "fixed to the screen".

Another requirement that is not met yet is the addition of metadata to locate the position of the audio source on the horizontal radius. This is needed to guide the user in which direction he needs to move to see the audio source/speaker of a displayed subtitles.

If there is an agreement on these requirements I think that they are now sufficiently scoped and the work on an explainer and the module can start.

@tairt have you considered using tta:pan to specify "the position of the audio source on the horizontal radius"?

Thanks @skynavga for the pointer. When I understood tta:pan correctly it is used to position audio from full left to full right pan.

The requirement we have is actually to have the geographical position of an object in a 360 video environment that relates to the subtitle. It could be expressed for example as longitude value (taking it over from the geographical coordinate system).

With this data a presentation processor could render some help icons (e.g. arrows) to point the user to the audio source of a subtitle when it is not present in the picuture. See below two examples from the IMAC project how this could be rendred:

grafik

tta:pan uses a 2D stereoscopic pan function based on the Web Audio StereoPannerNode interface; a more generalised PannerNode interface with 3D positional coordinates is also available.

@nigelmegitt Do you think that this vocabulary would best fit the requirement? Keep in mind that the actual audio would be in a lot of cases not "spatial". The information will be needed to locate the part of the video image a subtitle relates to. This will be in most cases the placement of the speaker in the video picture. So in that sense the term "audio source" may be ambigious as it not really about the position of the audio but the graphical representation of an object that in the context of the "story" is assumed to be the originator of the audio.

@tairt I was including it for completeness, thinking that it may be something we want to add as an additional feature later. My mental model here is a two stage process:

Some sounds are generated by something in a known physical location.
That physical location is mapped to a direction and distance relative to the listener's environment including position and direction of focus.

In other words, once you have resolved 2. then using PannerNode would be appropriate, possibly in combination with a GainNode to simulate the effect of distance. I agree it is unlikely to be adequate by itself if you want to delegate the presentation to presentation time (which I think we must do).

I added a proposal for the WIGG and drafted with @pthopesch am explainer.

Note also that there is a new W3C community group proposal for immersive captions (https://www.w3.org/community/groups/proposed/).

The Timed Text Working Group just discussed 360º Subtitles, and agreed to the following:

SUMMARY: This issue on hold for the time being; pick it up again when there's a concrete proposal e.g. from a CG.

The full IRC log of that discussion

<nigel> Topic: 360º Subtitles
<nigel> Andreas: We have an open issue on subtitles for 360º but we also include in scope
<nigel> .. augmented reality (AR) and virtual reality (VR), not only 360º.
<nigel> .. I want to report about the results. IRT is a partner in the EU project Immersive Accessibility (ImAc)
<nigel> .. alongside several partners.
<nigel> .. I first want to talk about current implementations,
<nigel> .. and the requirements that came out of user tests.
<nigel> .. Then I want to go into the discussions we had at W3C so far,
<nigel> .. and then discuss with you the next steps.
<nigel> .. [shows demo]
<nigel> Nigel: Aside to point out those access service icons were developed by Danish Radio (DR)
<nigel> .. and can be typed on a keyboard. I'm often asked to promote universal icons like that.
<nigel> Andreas: 1st option is subtitles that are fixed on the display screen, not fixed on the 360 sphere or on the speaker.
<nigel> .. They stay in the same place even when moving around the 360º.
<nigel> .. If the source of audio is not in the field of view, the viewer will not know what the subtitle refers to.
<nigel> .. If you hear the sound you may recognise someone and turn to face them.
<nigel> .. Here we show an arrow pointing to the speaker.
<nigel> Samira: Don't subtitles show the name of the speaker?
<nigel> Nigel: Different editorial practices around the world, some do, some don't.
<nigel> Andreas: Even someone who hears doesn't have the information always.
<nigel> .. It is new for captioning that you have the subtitle refer to a person in the scene but not in view.
<nigel> Glenn: Are these subtitles burned in, in this demo?
<nigel> .. I want to know how it discovers the current orientation of the speaker.
<nigel> Andreas: It cannot be burned in because of that, it's a good lead in to my next slide.
<nigel> .. [shows customisation options]
<nigel> .. I made a diagram to show how it works.
<nigel> .. (link to follow)
<nigel> .. Usually a flattened 2D version of the 3D video is transmitted
<nigel> .. For captions, you bring it together with the subtitle format, in this case IMSC, in an application,
<nigel> .. running here in the browser.
<nigel> .. Big issue also that Jun Mirai mentioned this week is the browser does not have support for 360º videos
<nigel> .. yet, so you need an external library.
<nigel> Samira: In WebGL
<nigel> Andreas: One is three.js, another is react360.
<nigel> Samira: Also babylon.js, they all use WebGL.
<nigel> Andreas: They take the video and subtitles together and project it on a sphere.
<nigel> .. This representation gets handed over to the XR devices using the VR API by W3C (may be called XR API).
<nigel> .. That API is supported natively in the browser.
<nigel> .. It also gives a back channel from the XR device about change of location.
<nigel> .. The application can then generate a different kind of WebGL.
<nigel> Glenn: This new Text Track API, not as sophisticated as WebGL obviously, could potentially be used t
<nigel> s/t/
<nigel> .. to hand the presentation task to the browser at a higher level.
<nigel> .. Right now the web application is formatting the IMSC, right?
<nigel> Andreas: Yes, that's also an issue.
<nigel> .. Processing is in the hands of the implementer.
<nigel> .. There is no conformant way of presenting IMSC for this kind of media.
<nigel> .. They do not really acknowledge all the details of IMSC for example although it is in the container.
<nigel> .. Ideally, something like an isSpherical attribute on the video element that then gives native support.
<nigel> Samira: Yes, then you get performance benefits, DRM and all the other benefits.
<nigel> .. This is my pie in the sky idea.
<nigel> .. I was thinking how nice it would be to have this.
<nigel> .. The browser could do the work for you.
<nigel> Samira: The Media side fed back yesterday that it's a lot of work.
<nigel> .. It could be that AR is higher priority than 360º for a lot of folks.
<nigel> Andreas: Mounir suggested that we just need two new attributes to the text track cue v2 api, but you also need the 360º video support.
<nigel> .. [Requirements]
<nigel> .. The consortium tried different options.
<nigel> .. 1. Tie subtitle to a position in the 3D environment. The user may miss this if they're not looking the right way.
<nigel> .. 2. Subtitles stay on a 2D plane and don't move.
<nigel> .. 3. Locate audio source "horizontally" - does not need vertical information.
<nigel> .. 4. Locate the audio source of the timed text both horizontally and vertically.
<nigel> .. [Standardisation efforts]
<nigel> .. Lots of discussions and activity to get attention.
<nigel> .. [lists calls and meetings in W3C] TPAC, M&E IG, TTWG, Immersive Web CG, GitHub, F2F, WICG, Breakouts
<nigel> .. This week was very good for discussions especially Samira's breakout session.
<nigel> .. The response was that there is no 360º support yet, natively, so hard to standardise how to deal with captions.
<nigel> .. Also not clear the best place for this work to be done.
<nigel> .. Some colleagues think Immersive Web CG for 360º in general.
<nigel> Samira: Yes the Immersive Web CG have done a lot of work in their DOM overlay, which could be very good for subtitles.
<nigel> Andreas: One option is to do it separately from the WebGL, send some separate information.
<nigel> Samira: I need to take a closer look at how the overlay API looks.
<nigel> Vladimir: Question on architecture slide. According to this the XR API gives you the location information.
<nigel> .. Then why do you split positional requirements for H vs H+V into two different requirements?
<nigel> Andreas: It's easier, to minimise the requirements for this service.
<nigel> Vladimir: The two implementations would be the same though.
<nigel> Andreas: Yes but for the author it is easier just to point to a point on a circle than on a sphere.
<nigel> .. This can be discussed, I just reduced to the minimum that is definitely needed.
<nigel> .. It would be good to add that as it gets it back from the XR device.
<nigel> .. [Next Steps]
<nigel> .. From the discussions I think the whole ecosystem for captions and subtitles is not mature enough to start
<nigel> .. a spec activity. I don't know what kind of application we are actually targeting.
<nigel> .. For the moment it is just user defined applications based on libraries, nothing for the web platform.
<nigel> .. I was possibly too optimistic in January when I wanted an extension module this year.
<nigel> .. It is not satisfying because at the moment a lot of 360º videos have no captioning and are not accessible.
<nigel> .. My proposal is to incubate, in a CG or WICG, I'm not sure.
<nigel> .. We need more than the people here in the room to solve the problem.
<nigel> .. We need people from Immersive Web.
<nigel> .. XR devices are different from this view in the browser. They have different constraints to consider.
<nigel> .. If we think that it is ready for starting a spec activity then it could be handed over to a WG and the TTML part would be
<nigel> .. dealt with here.
<nigel> .. The only problem is resource to follow up on this - there's not enough.
<nigel> Samira: How do you know right now who is speaking and where to put the arrow?
<nigel> Andreas: In the project there is a caption preparation tool provider who includes a tool for specifying the location and
<nigel> .. the name of the speaker.
<nigel> Gary: How does it look in the caption format?
<nigel> Andreas: Longitude and Latitude in angles.
<nigel> Gary: Then it compares to the camera angle?
<nigel> Andreas: Absolutely, yes.
<nigel> Gary: Then is that the idea for the XR API to know the camera angle?
<nigel> Andreas: I think so, yes
<nigel> Gary: For WebVTT, to do it natively we would have to figure out how to do it properly.
<nigel> .. Understand the video is spherical, etc. A lot of moving parts.
<nigel> Andreas: You get this information from the XR device session.
<nigel> .. At first the information looks easy, but then there are other things to consider,
<nigel> .. how to define normatively, or what behaviour to define.
<nigel> Gary: When I played with 360º captions we gave the positional data out of band from the text tracks, but it could be in band.
<nigel> .. Because we also decided that it should only be on a single plane and not move, that meant you could have a normalised
<nigel> .. vector for the position, and compare it. It sounds like you don't need anything more than that.
<nigel> Andreas: No, it would be interesting to compare implementations.
<nigel> .. Process wise, I have no proposal. In my desperation a bit I suggested on Monday if there are not enough people
<nigel> .. to do it in a CG, just do it in a private repo, put an MIT licence on it, and do it there.
<nigel> .. We make a lot of fuss about it and use time.
<nigel> .. Ideally there would be a Web Immersive CG 360 repo we could use to discuss this kind of stuff.
<nigel> .. There's a new CG proposed by Google called immersive Captions but the Immersive Web CG doesn't know about this activity.
<nigel> .. There were 5 supporters so it has started, so there would also be the alternative of doing it there.
<nigel> .. Views?
<nigel> Samira: You can also put your comment in the issue I opened.
<nigel> -> https://github.com/immersive-web/proposals/issues/55 Issue about 360 video
<nigel> Andreas: I didn't comment on that. We opened another issue on December and it was hard to get attention on it.
<nigel> Samira: Native support is probably not in the near future but as a middle ground I could create a web component.
<nigel> .. Then the subtitles can be a part of that.
<nigel> Andreas: Wouldn't it be a good idea to build the web component based on something in a repository rather than an issue?
<nigel> Samira: Yes I need to find out how to get the issue promoted.
<nigel> Nigel: I would propose a TTML extension that simply states the related 3D location for a bit of content.
<nigel> Andreas: My opinion is we need more expertise outside the group, and the better way might be to generate a proposal
<nigel> .. for WebVTT and TTML in a CG and then bring it here.
<nigel> Gary: If nobody else cares then we should not spend effort on it.
<nigel> Andreas: My proposal is to bring it when we have something more concrete.
<nigel> Nigel: Nobody will object to doing less work here!
<nigel> Andreas: Here, it seems like both TTML and WebVTT extensions are of interest.
<nigel> .. So we should combine the effort.
<nigel> Gary: It doesn't seem like a lot of work but we shouldn't do it if nobody will use it.
<nigel> Nigel: We should change the status of the issue on tt-reqs to say it is on hold for the time being.
<nigel> Andreas: I am happy to do that.
<nigel> github: https://github.com/w3c/tt-reqs/issues/8
<nigel> SUMMARY: This issue on hold for the time being; pick it up again when there's a concrete proposal e.g. from a CG.

w3c / tt-reqs

Support 3D space (360°/VR/XR) as target presentation environment #8