Spec language precludes non-visual uses

ddorwin commented 4 years ago

There are XR use cases (e.g., "audio AR") that could build on poses and other capabilities exposed by core WebXR (and future extensions). The current spec language, though, appears to require visual devices. The superficial issues can probably be addressed with a bit of rewording, though there may be some more complex issues as well.

Some of the most obvious examples revolve around the word "imagery":

"An XR device is a physical unit of hardware that can present imagery to the user."
"Once a session has been successfully acquired it can be used to... present imagery to the user."
"A state of visible indicates that imagery rendered by the XRSession can be seen by the user…"

More complex issues might include XR Compositor, assumptions about XRWebGLLayer, definition and/or assumptions about XRView.

While AR is out of scope for the first version of the core spec, it would be nice if the definitions weren’t technically incompatible with such use cases and form factors.

frastlin commented 4 years ago

I am building maps utilizing VR audio. At this point it is 3D web audio, but when WebXR is enabled, I will be connecting pos data and localizing the user's position into the VR map and connecting that to the audio listener. My company is also focusing heavily on AR for visually impaired users, using pos estimation of a stylus to label objects, and using pos info and computer vision localization for way-finding. Projects like SoundScape also would be considered augmented reality for audio sources. One could also connect sound sources to markers, there is no reason why a visual marker is the only marker that is used. I must say, as a blind developer myself, the cognitive load of learning much of the AR technology is rather high, as everything is based off visual feedback and I'm needing to translate it to audio feedback without really understanding what I am doing first. Neutralizing this language would really encourage equity within the XR space, as there is nothing inherently visual about XR. Once more advanced digital haptic displays become more commonplace, then visual display will only be 1/3 of the XR experience. Neutralizing the language now will mean a more robust specification for the future.

frastlin commented 4 years ago

Also, many of my users will use devices that don't have screens, such as: https://www.hims-inc.com/product/braillesense-polaris/ That device also is missing a touchscreen and a keyboard, and I'm still working to get non-semantic browser-based apps supported, but these are devices blind users will be using to access WebXR. Most blind users utilize screen curtain on their phone and PC, which switches off the display, and they would prefer their computer not render graphics, as they are really not useful and they drain the battery. It may be useful to separate output devices and allow the user to disable particular outputs from being sent to the sound card, GPU, and whatever the tactile processing unit will be called in the near future. This could be another issue, but have semantic elements been defined for WebXR? Screen readers and other assistive technologies will need to be able to access attributes of menus, alerts, pointer information, go into an input help mode so they can press buttons and hear what the input buttons are on their probably not Braille-labeled device, and be able to access meshes in the modality of their choice. I also don't want another div fest in XR, where everything will be custom built because the HTML widgets are "too hard to customize".

frastlin commented 4 years ago

Also, animation in the spec should be clear that it is not "the manipulation of electronic images by means of a computer in order to create moving images.", but instead "the state of being full of life or vigor; liveliness." https://www.lexico.com/en/definition/animation

Same with "View". It should be "regard in a particular light or with a particular attitude." rather than "the ability to see something or to be seen from a particular place." http://english.oxforddictionaries.com/view

Everything should be written as if the user could access the object or scene from any sensory modality.

frastlin commented 4 years ago

Do you think it would be useful to broaden devices from Headset Devices to immersive devices? I'm pretty sure tactile displays won't be headsets, but probably gloves or nerve interfacing displays. I have already been asked to build a tactile only VR display for a map. Also, book reader devices for braille will also not have speech or visual output.

toji commented 4 years ago

Re: "Audio AR", my impression is that it's referring to something similar to the Bose AR glasses. Based on the information that I've been able to find about those kinds of uses I'm not entirely sure how they perform their location-based functions. I would actually be surprised if it was built around any form of real positional tracking, and am guessing it's more along the lines of Google Lens which surfaces information based on a captured image with little understanding of the device's precise location. In any case, I'd love to know more about how existing or upcoming non-visual AR devices work so we can better evaluate what the appropriate interactions are with WebXR.

Now, ignoring the above questions about how current hardware works, if we assume that we discover a device that provides precise positional tracking capabilities but has no visual output component we can brainstorm how that would theoretically work out. While it's not clear how web content would surface itself on such a device, it seems safe to say that traditional immersive-vr style content wouldn't be of much interest, and so we'd likely want to advertise a new session mode explicitly for audio-only sessions. Let's call it immersive-audio. Once that's established the various text tweaks David mentions would be appropriate, but from a technical point of view the biggest change would be that an immersive-audio session wouldn't require a baseLayer in order to process XRFrames. Instead we would probably just surface poses via requestAnimationFrame() as usual and allow Javascript to feed those poses into both the WebAudio API for spatial sound and to whatever services are needed to surface the relevant audio data. There's also some interesting possibilities that could come from deeper integration with the audio context, like providing poses directly to an audio worklet.

Regardless, given the relative scarcity of this style of hardware today and the large number of unknowns around it I don't see any pressing need to move to support this style of content just yet. It's absolutely a topic that the Working Group should follow with great interest, though! Marking this issue as part of the "Future" milestone so we don't lose track of it.

frastlin commented 4 years ago

if we assume that we discover a device that provides precise positional tracking capabilities but has no visual output component we can brainstorm how that would theoretically work out.

It's possible to have a camera, GPS, accelerometer, Bluetooth, Wi-Fi, and any number of sensors within a device without a screen, see: https://www.hims-inc.com/product/braillesense-polaris/

It's not explicitly used for VR, but it's an Android-based phone which can access the web and can do location tracking.

While it's not clear how web content would surface itself on such a device

Web content is accessed through both Braille and text to speech from a screen reader.

it seems safe to say that traditional immersive-vr style content wouldn't be of much interest, and so we'd likely want to advertise a new session mode explicitly for audio-only sessions. Let's call it immersive-audio.

What's your definition of "traditional immersive-vr style content"? Reality itself is inherently multisensory, and virtual reality has attempted to mimic this multisensory approach since the 1920s: https://www.vrs.org.uk/virtual-reality/history.html

I am actually not aware of any VR devices that have only visual output. All the devices I'm aware of have either audio only, or audio and visual. Nonvisual users do not want a separate audio only mode, as that would lead to greater discrimination and isolation. Instead, nonvisual users would like to access all VR and AR content, and if it is at all usable completely in audio (like most websites and many games unintentionally are), they want to be able to access that content as much as possible on their device.

toji commented 4 years ago

I apologize, because I think two subject have been confusingly conflated here. My comments were primarily aimed at theorizing about how developers could create content specifically for "audio first" devices if desired. Allowing developers to target content to a specific form factor is something that we're heard repeatedly from developers is important, and this scenario would be no different.

(And I will admit that, due to the circumstances in which I wrote my comment, I actually didn't even see your previous comments till just now. Sorry!)

As API developers we see this as distinct from the accessibility considerations you describe for a variety of reasons, primarily to avoid discriminatory content that you mentioned. We definitely want to provide a variety of ways for developers to make content which has a visual component more accessible to nonvisual users. And while we don't prevent developers from adding audio to their VR experiences it's not as fundamental to the API's use as the rendering mechanics, and definitely should not be relied on as the soul source of accessibility.

We've had some conversations about this in the past (I'm trying to dig them up to link here), and there's been some recent threads started about visual accessibility as well. It's a tough issue, and one we take seriously, but also isn't the intended topic of this specific issue.

frastlin commented 4 years ago

while we don't prevent developers from adding audio to their VR experiences it's not as fundamental to the API's use as the rendering mechanics, and definitely should not be relied on as the soul source of accessibility.

I think this is the topic we're discussing in this issue. Why do the rendering mechanics seem to require visual feedback? Can't we separate visual, auditory, and tactile rendering into their own render mechanics separate from the main loop? There is nothing about user position, accessing sensors detecting information about the user's environment, tracking objects, connecting between virtual space and physical space, or obtaining user input that is inherently visual. This means that the majority of the XR experience is not visual, and it shouldn't be required to be visual. Nothing in the WebXR scope is visual either.

Currently I see very little about audio or tactile rendering in the documents like: https://github.com/immersive-web/webxr/blob/master/explainer.md The above document has audio mentioned once and is the only document within the spec that has the words "audio", "sonic", "auditory", or "sound". There is absolutely no mention of tactile displays.

I would like to see:

Language in the general spec switched from visual to a-modal.
Examples given of XR experiences in modalities other than just visual.

kearwood commented 4 years ago

Hello from Mozilla!

This is a very interesting concept. I would love to explore the kind of experiences that you may enable with an API like WebXR without visuals. Do you have some particular ideas in mind that could help give context?

One that comes to mind for me is "Papa Sangre": https://en.wikipedia.org/wiki/Papa_Sangre

Perhaps a similar story could be more immersive if expanded into a non-visual, room scale experience.

Thanks for sharing your perspective. I would like to learn more.

Cheers,

Kearwood "Kip" Gilbert

On Fri, Aug 23, 2019 at 2:00 PM Brandon Jones notifications@github.com wrote:

I apologize, because I think two subject have been confusingly conflated here. My comments were primarily aimed at theorizing about how developers could create content specifically for "audio first" devices if desired. Allowing developers to target content to a specific form factor is something that we're heard repeatedly from developers is important, and this scenario would be no different.

(And I will admit that, due to the circumstances in which I wrote my comment, I actually didn't even see your previous comments till just now. Sorry!)

As API developers we see this as distinct from the accessibility considerations you describe for a variety of reasons, primarily to avoid discriminatory content that you mentioned. We definitely want to provide a variety of ways for developers to make content which has a visual component more accessible to nonvisual users. And while we don't prevent developers from adding audio to their VR experiences it's not as fundamental to the API's use as the rendering mechanics, and definitely should not be relied on as the soul source of accessibility.

We've had some conversations about this in the past (I'm trying to dig them up to link here), and there's been some recent threads started about visual accessibility https://github.com/immersive-web/proposals/issues/54 as well. It's a tough issue, and one we take seriously, but also isn't the intended topic of this specific issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/immersive-web/webxr/issues/815?email_source=notifications&email_token=AAY27QGAW4TWQXY457Z3R5TQGBFVPA5CNFSM4IOYBBGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5BKELQ#issuecomment-524460590, or mute the thread https://github.com/notifications/unsubscribe-auth/AAY27QFFDDWN56NNKTYN3UTQGBFVRANCNFSM4IOYBBGA .

frastlin commented 4 years ago

Auditory XR displays

There are around 700+ only games that can be played completely using audio at audiogames.net.

Papa Sangre is one very good example. I wrote a paper discussing the types of interfaces in audio games and the first-person 3D interfaces are the types of interfaces normally used for XR. One use case is the geographical maps I created that allowed users to explore a physical space in audio. There is a yearly week-long conference specifically on auditory displays called the International Conference on Auditory Display. This last year, there was a paper on locating objects in space using AR. This would help someone find the jar of pickles that is in the back of the fridge behind everything. Another paper used 3D audio to create an interface that could help nonvisual users drive. Another paper explores an XR performance between musicians and dancers who were in different locations. Another paper presents an XR instrument that allows the exploration of Kepler's Harmonies of the World.

These are the papers off the top of my head from the last 2 years of ICAD. I'll get a few other sonification researchers to give their input. Each of the above papers has an extensive literature review that give even more examples.

Edit: Here is an article with more examples of Sonification that is easier to read than academic papers

For AR:

[Here is a project using computer vision and other sensors to provide turn-by-turn navigation indoors using audio.(https://link.springer.com/chapter/10.1007/978-3-319-94274-2_13) This project is using Aruco markers to digitally annotate objects that (for some reason) only have nondigital visual labels

Tactile

TouchX is an interactive XR tactile device. HaptX Gloves are another tactile display that are gloves. Prime Haptic are another tactile VR glove. VRgluv is yet another glove.

If you do a search for "Haptic Gloves", you'll find hundreds of examples. Do another search for "Haptic VR" and you'll find displays such as:

Woojer Vest a haptic vest. BHAPTICS a haptic body suit.

Haptic Only Experiences

Accessing the internet with only a haptic glove

Some other experiences that need to be done through XR touch include:

Games
Being able to shop and try on shoes or clothes
Sculpting
Reading Braille in an interactive 3D book
Having a 3D tactile map that updates as you move.
Being able to touch what is happening in movies or pictures.
Building 3D CAD models nonvisually.
Bisecting frogs without the mess.
Video Conference between two haptic ASL signers.
Working out while watching a movie.

...

Even if you used your sight for most of the above activities, I guarantee someone, like me, will only use touch.

Reading in Modes Other than Visual

Often the question is: "How is one going to access the web with just an audio or haptic interface?" Braille is tactile, speech is auditory. I'm looking for the language one of the big companies is attempting to make with just haptic feedback that's similar to this vibration code. There was a hand that could finger-spell in ASL, There is ELIA a tactile alphabet, and there are new modes of multisensory symbolic communication being developed all the time.

Cherdyakov commented 4 years ago

This is a really good topic of discussion. Building on your last comment @frastlin, another application for auditory displays is data analysis and data exploration. Wanda L. Diaz Merced is a visually-impaired researcher who worked on auditory display of physics data with NASA. Her research was with low-dimensional data, but spatialization is a popular area of sonification research with benefits similar to 3D data visualization, allowing for the mapping of additional dimensions or data relationships to space. Sometimes the spatial relationship in the sound is related to actual spatial relationships in the data, as in this paper on meteorological data sonification, but it can also be used in a representational way, or merely to increase performance in audio process monitoring tasks.

For a significant chunk of this research, accessibility is an added benefit. Most of the research in this area is for enhancing data analysis and process monitoring for all users. Even users who take advantage of visual displays are researching audio-only and audio-first immersive technology. The accessibility benefits are significant of course, and sonification is a topic of research for equal-access in education (1), (2), which makes support for auditory display in immersive web technologies exciting as on-line education becomes a norm. It would be great for immersive tech on the web to go even further than traditional tech in this direction.

LJWatson commented 4 years ago

I'd caution against creating too hard a line between visual and audio XR. There will be times when both sighted and non-sighted people will want to experience the same XR space, and either to share that experience in realtime, or be able to compare experiences later on. There will also be times when an XR space is entirely visual or entirely audio of course.

The language in the spec (understandably) emphasises the visual, but I think @ddorwin is right in saying that some slight changes to the language could gently protect the spec from inadvertently restricting future possibilities.

frastlin commented 4 years ago

Exactly, I would like to see:

Language in the general spec switched from visual to a-modal.
Examples given of XR experiences in modalities other than just visual, including auditory only, visual and auditory, and maybe visual, auditory, and tactile.

I don't think this is extremely radical, and my hope is that 90% of the content will be multisensory. What I would like to see is a recognition that an XR experience could be visual, auditory, tactile, or any combination of the senses.

kearwood commented 4 years ago

For the webidl portion, perhaps a non-visual XRSession could be created without an XRLayer. It would also be interesting to explore what kind of XRLayer derivatives would support additional sensory feedback devices. This could also have some implications on modality, perhaps with an additional XRSessionMode to indicate the capability of new kinds of immersion with such devices. It would be of great benefit to have someone with such direct experience to guide these kind of choices.

I suspect that such changes could be made additive-ly without breaking compatibility with the existing core WebXR spec. Would anyone be interested (or opposed) to creating an incubation repo to continue this discussion more openly?

Cheers,

Kip

On Tue, Aug 27, 2019 at 8:03 AM Brandon notifications@github.com wrote:

Exactly, I would like to see:

Language in the general spec switched from visual to a-modal.

Examples given of XR experiences in modalities other than just visual, including auditory only, visual and auditory, and maybe visual, auditory, and tactile.

I don't think this is extremely radical, and my hope is that 90% of the content will be multisensory. What I would like to see is a recognition that an XR experience could be visual, auditory, tactile, or any combination of the senses.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/immersive-web/webxr/issues/815?email_source=notifications&email_token=AAY27QH4XBYGOU3627LNMSLQGU63PA5CNFSM4IOYBBGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5IBSDQ#issuecomment-525342990, or mute the thread https://github.com/notifications/unsubscribe-auth/AAY27QDVW6TRN5NJJRSGFMDQGU63PANCNFSM4IOYBBGA .

frastlin commented 4 years ago

Sure, another repo for this may be useful.

For the webidl portion, perhaps a non-visual XRSession could be created without an XRLayer.

What do you mean without an XR layer? It would just be without the visuals. I have it on my short list of things to do to get a WebXR app working on my IPhone through WebXR Viewer. I will want access to all the APIs and tracking info a WebXR session gives, I just won't be using WebGL for anything and instead will be making the UI out of the Web Audio API, and aria-live regions with an optional Web Speech API for those without a screen reader. After I make my first app, I can give more guidance on what could be changed.

cmloegcmluin commented 4 years ago

I can provide yet another example of an audio-only virtual reality experience using the API. I am working on an audio-only WebXR experience using Bose AR. It is a musical composition of mine which is "epikinetic", i.e., your body's movement causes effects besides merely translating and reorienting your avatar; in this case, the music progresses forward through its developments and changes properties depending on your motion.

On Tue, Aug 27, 2019 at 11:42 AM Brandon notifications@github.com wrote:

Sure, another repo for this may be useful.

For the webidl portion, perhaps a non-visual XRSession could be created without an XRLayer.

What do you mean without an XR layer? It would just be without the visuals. I have it on my short list of things to do to get a WebXR app working on my IPhone through WebXR Viewer. https://apps.apple.com/us/app/webxr-viewer/id1295998056 I will want access to all the APIs and tracking info a WebXR session gives, I just won't be using WebGL for anything and instead will be making the UI out of the Web Audio API, and aria-live regions with an optional Web Speech API for those without a screen reader. After I make my first app, I can give more guidance on what could be changed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/immersive-web/webxr/issues/815?email_source=notifications&email_token=ABCBA7OK4NOXTBNNR5FMZU3QGVYTDA5CNFSM4IOYBBGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5IXJNA#issuecomment-525431988, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCBA7IU2ZXQ6VHFQLEDNDLQGVYTDANCNFSM4IOYBBGA .

kearwood commented 4 years ago

What do you mean without an XR layer?

More specifically, I mean without an XRWebGLLayer, allowing you to use WebXR without having to create a WebGL context:

https://immersive-web.github.io/webxr/#xrwebgllayer-interface

The language of the spec says that we would not be able to get an active XRFrame if XRSession.baseLayer is not set to an XRWebGLLayer. I am proposing that we explore options of allowing non-visual XRSessions, without an XRWebGLLayer, that can still get poses from an XRFrame for use with WebAudio.

Of course, it would also be possible to use WebXR while rendering nothing but a black screen. I would like to know if allowing usage of WebXR without the rendering requirements would perhaps enable this to be used in more scenarios, such as on hardware that has no display or GPU present at all.

frastlin commented 4 years ago

Yes, removing the GL layer, or not requiring it, would be perfect. Is it possible to run a browser like Firefox or Chrome without a GPU? When I make a computer, I always add in an inexpensive GPU because many things don't run unless you have a GPU. But removing the GPU will increase accessibility to the WebXR spec for devices that may not have a GPU.

joshueoconnor commented 4 years ago

Great thread, thanks @ddorwin and also to @frastlin for the great links!

frastlin commented 4 years ago

Hello, Has a separate repository been created for discussion of nonvisual usage? I've been studying the API, and to be honest, if it is possible to easily run a blank XRSession.requestAnimationFrame, then we'll be fine. XRSession.requestAnimationFrame feels like a main loop to me, which does much more than just schedule graphics. It can update the pos of objects, run events, and estimate delta time so everything is in sync with your application. I've not had problems with using setTimeout for any of these functions, but having an XR mainLoop seems useful, especially if the XRSession.requestReferenceSpace requires it. Is there a plan to add a setTimeout that runs either at game speed or delta speed?

frastlin commented 4 years ago

Should I start going through the spec and pushing changes and adding examples?

toji commented 4 years ago

I've filed a new issue to track evaluating how the API can/should interact with audio-only devices at #892 so that this thread can stay focused on ensuring the spec doesn't require a visual component, which has gotten a lot more discussion thus far.

I've been studying the API, and to be honest, if it is possible to easily run a blank XRSession.requestAnimationFrame, then we'll be fine. XRSession.requestAnimationFrame feels like a main loop to me, which does much more than just schedule graphics. It can update the pos of objects, run events, and estimate delta time so everything is in sync with your application.

Yes, this would be the right path for non-visual uses of the API today. For historical reasons there's a couple of points within the API that indicate that a baseLayer (which ties the XRSession to a WebGL context) must be set before the frame loop will run. This was because we were originally worried about the page spinning up a session that was then secretly used to track the user invisibly. Requiring that there be some sort of output to the page was intended to give a sense of security that the page couldn't track you movement without you noticing.

The API has evolved a fair amount since then, and that concern no longer really applies. The primary reason why is that we have some level of user consent baked into the API now for any scenario where the page might have otherwise been able to silently initiate device tracking. As such the requirement for a baseLayer can reasonably be seen as unnecessary and we can look at removing it.

In the meantime, it's pretty easy to create an XRWebGLLayer, set it as the base layer, and then never draw to it? (You could take it a step further and create the XRWebGLLayer with a framebufferScaleFactor of, say, 0.01 to get a really tiny WebGL buffer back if you were worried about the memory costs.) The headset will just output black, but the frame loop will run correctly. The key is to just query the XRViewerPose as usual each frame and then feed the viewer's transform into Web Audio's PannerNode. This will require converting WebXR's orientation quaternion into a direction vector each frame, but that's not hard or expensive.

I've not had problems with using setTimeout for any of these functions, but having an XR mainLoop seems useful, especially if the XRSession.requestReferenceSpace requires it. Is there a plan to add a setTimeout that runs either at game speed or delta speed?

setTimeout() should not (cannot) be used with WebXR, as it will not provide the XRFrame object that XRSession.requestAnimationFrame() does, which is needed for tracking. It also won't properly sync the XR device's display refresh rate, but obviously that's a lesser concern for a non-visual app. Here it would really be more about minimizing latency.

frastlin commented 4 years ago

OK, so starting the language discussion, what is an XR device? According to line 113 in explainer.md:

The UA will identify an available physical unit of XR hardware that can present imagery to the user, referred to here as an "XR device".

To update that definition, here are a couple possibilities:

The UA will identify an available physical unit of XR hardware that can present immersive content to the user
The UA will identify an available physical unit of XR hardware with world tracking capabilities
The UA will identify an available physical unit of XR hardware that can present content to the user
The UA will identify an available physical unit of XR hardware that can interface with the XR device API

toji commented 4 years ago

I think something like the first definition you listed would be appropriate, and probably deserves some further explanation as well. (I'm realizing now that we probably don't ever define the term "immersive?" Oops.) Maybe something like this:

The UA will identify an available physical unit of XR hardware that can present immersive content to the user. Content is considered to be "immersive" if it produces visual, audio, haptic, or other sensory output that simulates or augments various aspects of the users environment. Most frequently this involves tracking the user's motion in space and producing outputs that are synchronized to the user's movement.

frastlin commented 4 years ago

I like it! So would you like to make the change, or should I? I'm reading through explainermd and see phrases that should be changed:

Line 141:

With that in mind, this code checks for support of immersive VR sessions, since we want the ability to display imagery on a device like a headset.

Should be:

With that in mind, this code checks for support of immersive VR sessions, since we want the ability to display content on a device like a headset.

Line 188 says you are required to have a WebGL base layer, and I don't think it should be needed, should we start another issue discussing the WebGL base layer requirement?
Line 275 should have a section on working with the web audio API, saying how to convert from the world pos to web audio API pos. I think the web audio API should take the XR frame output, or the XR frame object should have a function to output the web audio positional arguments. The same should be mentioned around line 290 where the document talks about the viewer object and using the projectionMatrix and transform with WebGL.
Line 278, should the XR frame also make a copy of the audio listener object?
There should be an example around line 292 with the Web Audio API and pos object.
Line 323:

These should be treated as the locations of virtuals "cameras" within the scene. If the application is using a library to assist with rendering, it may be most natural to apply these values to a camera object directly, like so

The pos should not be only for cameras, but listener objects as well.

These should be treated as the locations of virtual "cameras" or "listeners" within the scene. If the application is using a library to assist with rendering to a webGL canvas, it may be most natural to apply these values to a camera object directly, like so

I'm wondering if this whole section should be under a subheading called "Viewer Tracking with WebGL" because the discussion should be focused on the viewer tracking as a whole and not just updating viewer tracking in a webGL context. Many 3D libraries like Babylon also move the audio listener object along with the camera, so the user is going to need to be aware if their library does that.

Line 439 should discuss if the two XR modes effect visual content only or other devices. Headphones should be considered immersive like head mounted displays and a mono speaker, like that from your phone, should be considered inline. But someone could have the visual mode as inline while listening with headphones, or have their phone screen be immersive while listening to their phone's audio. On IOS, there are games that only work with headphones, I'm not sure if one can make the same determination in the browser.
Line 484, can we have audio headset be a required or optional feature? What about vibration or force feedback?
Line 528:

Controlling rendering quality

Should be:

Controlling rendering quality Through WebGL

Line 563 talks about the near and far plane. There should be attenuation for the audio objects talked about here.
It would be great if there was some feature of WebXR that dealt with object Attenuation or filtering, for example, a sound that is behind a wall will not be visible, but can be heard with the muffling the wall provides. It is a similar concept to the depth with the visual display. Also, there should be some similar setting between the far distance one can see and the far distance one can hear. They may not be the same, but naturally there is some correlation in the real world.
Line 635 asks: "### These alternatives don't account for presentation" I think the response should position WebXR as a way to coordinate audio, visual, tactile and other sensory displays with position and orientation information. It presents this information in a way that allows efficient rendering in each modality.

toji commented 4 years ago

Pull requests are definitely welcome regarding this issue! Any minor issues can be worked out in the review process.

A few specific comments:

Line 188 says you are required to have a WebGL base layer, and I don't think it should be needed, should we start another issue discussing the WebGL base layer requirement?

Yes, though be aware that it would be a (minor) backwards compat issue and there's not going to be much appetite for actively addressing it right away, especially since in the meantime the path of setting up a dummy layer offers a way forward for non-visual content.

Line 275 should have a section on working with the web audio API, saying how to convert from the world pos to web audio API pos.

There should be an example around line 292 with the Web Audio API and pos object.

Having a section regarding interop with WebAudio would be great for the explainer! (It's not going to be the type of thing that we'll be able to surface in the spec itself, though.)

I think the web audio API should take the XR frame output, or the XR frame object should have a function to output the web audio positional arguments.

Line 278, should the XR frame also make a copy of the audio listener object?

These sounds like a topic for discussion in #390

Line 323: ... The pos should not be only for cameras, but listener objects as well.

Lets be careful here, because listeners should absolutely not be placed at the transforms described by the views array. Each view really truly does represent a visual perspective into the scene, defined through some combination of the user's physical characteristics (IPD) and the device optics. Placing listeners at any individual view would be akin to attaching ears to your eyes.

Instead, when integrating with audio APIs the views array can be ignored entirely and the transform of the XRViewerPose itself should be used to position listeners, as that is more likely to align with the center of the user's head. Standard HRTF modeling will handle the discrepancy between the position of each ear.

Line 439 ...On IOS, there are games that only work with headphones, I'm not sure if one can make the same determination in the browser.

Not that I'm aware of, nor do I think the browser is particularly interested in communicating that due to fingerprinting concerns. I think a written/spoken disclaimer that the experience won't work as intended without headphones is the most reliable way forward here.

Line 484, can we have audio headset be a required or optional feature? What about vibration or force feedback?

There's ongoing discussions about what's appropriate to allow as a required/optional feature. I don't have a clear answer on that right now.

Line 563 talks about the near and far plane. There should be attenuation for the audio objects talked about here.

These are not related concepts. The near and far plane are explicitly related to the projection matrix math done for WebGL and should have no effect on audio. Audio attenuation is wholly the responsibility of the WebAudio API and, to my knowledge, is a content-specific choice rather than a device intrinsic.

It would be great if there was some feature of WebXR that dealt with object Attenuation or filtering...

Again, this falls outside the scope of WebXR, and should be facilitated by WebAudio (likely in conjunction with a library) WebXR has no concept of a 3D scene or rendered geometry or anything like that. It is a mechanism for surfacing the device's sensor data in a way that enables developers to present their content appropriately, and facilitates outputting visuals to the hardware because that's not adequately covered by any existing web APIs. Anything beyond that is the responsibility of the developer. Libraries like A-Frame and Babylon are more opinionated about how their content is represented, and thus are a better place to define audio interactions like this.

frastlin commented 4 years ago

I submitted a PR for many of the changes we talked about for explainer.md. I did another pr for the spec itself to change the definition of XR device. I submitted a PR with an example to connect WebXR and Web Audio I Opened up an issue to remove WebGL.

RealJoshue108 commented 3 years ago

@frastlin I'm reviewing this fascinating thread - and am wondering what came out of this discussion and the various PRs etc that you suggested?

I'd appreciate it if you could post a brief update, or ping me on joconnor(at)w3.org. Many of your original points relate to work we are doing in the Research Questions Task Force, for example what you are suggesting is in line with our thinking around the idea of 'modal muting' - where visual modes may not be needed or consumed by a device or user agent, but can still be in sync when expressed as functions of time, in termed of shared spaces used in immersive environments.

It would be also great to get your input into our work :-)

RealJoshue108 commented 3 years ago

While I know that @frastlin is well aware of this - something for others involved in the development of WebXR specs to consider is that the term screen reader is a misnomer. That is only part of what they do - they also facilitate navigation and interaction.

frastlin commented 3 years ago

@RealJoshue108 None of my PRs have been accepted. https://github.com/immersive-web/webxr/pull/925 was marked as unsubstantive and I'm not sure what to do with https://github.com/immersive-web/webxr/pull/927 to determan affiliation. https://github.com/immersive-web/webxr/pull/930 still needs some testing.

Google's AR with animals is one example where not connecting the XR position and listener position has been of major detriment to the experience. When I move around and turn my head, the sound never changes. This is exactly what I mean when I say that the AR and VR listeners need to have an easy way to sink with one another. I would love to interact with the AR animal that is in our room, but because it's too difficult to sink audio and visuals, it was not done. Please can we fix this before Web XR becomes more prevalent? @toji can we add a head size argument somewhere? How is the distance between two eyes determined? For ears, there is a head transfer related function (HRTF) that figures out how sounds should be changed based off head size. There is a default setting for this in the 3D audio spec, but to my knowledge, there's no way to change head-size in the spec. If the WebXR spec could give options for this, it would probably be useful for both visual and auditory output. For quick and easy, we just need to position the audio listener to match the position and tilt of the head. I can't emphasize how important it is that the Web XR spec allow easy sinking of the visuals and audio. Otherwise it's not XR, it's visual XR.

frastlin commented 3 years ago

@RealJoshue108 Currently, there is no semantic method of interacting with XR content in the browser, so screen reader navigation functions are turned off. Another tool that XR could enable is a hidden aria live region to send messages to a screen reader. There are two semantics in XR that currently exist for screen readers: an element that grabs focus, and an aria live region.

RealJoshue108 commented 3 years ago

@frastlin Have you looked at the DOM Overlays API spec? This is promising, as it allows HTML and other code like ARIA attributes or potentially even personalization semantics to be embedded within an XR environment.

frastlin commented 3 years ago

@RealJoshue108 this is very useful for overlays, and if combobox or edit field HTML elements can be the overlay, and grab the focus of the screen reader, then it would work really well for overlays. It's never ever a good idea to have non screen reader users mucking about in aria, it's like programming CSS without a screen. I would highly recommend either new elements, or new versions of the existing elements for overlays.

There also needs to be some kind of access to the meshes that is nonvisual. Similar to how HTML declares elements on a page, there need to be similar elements of the XR space. That way, screen readers or other user agents can add their own tools for interacting with the XR meshes or objects that don't require the creator to know anything about their users interaction patterns. A poignant example of how this would be extremely useful is found at the Mozilla Hubs project. I love the idea, but as a screen reader user, I'm so lost in this space.

There's no indication letting me know I'm moving or hit a wall or other object.
There's no way for me to find anyone unless they're talking.
I have no idea what the environment looks like, or what objects are in the environment.
I am unable to create or draw objects
There is no way for me to turn off the screen rendering which slows down my computer and makes my screen reader lag.

To fix these problems, there needs to be a mesh and object DOM, along with collision events. There needs to be a name requirement for the objects, and there needs to be some kind of way to show the environment. This can not be left up to the engines like AFrame. Otherwise I can wave goodbye to any XR that's not built with nonvisual users in mind. The current brows mode for screen readers is and will not be useful in XR, so another semantic language needs to be developed. Once tactile XR is more mainstream, almost every object will need to be accessible both visually and in tactile.

klausw commented 3 years ago

@frastlin wrote:

@RealJoshue108 this is very useful for overlays, and if combobox or edit field HTML elements can be the overlay, and grab the focus of the screen reader, then it would work really well for overlays. It's never ever a good idea to have non screen reader users mucking about in aria, it's like programming CSS without a screen. I would highly recommend either new elements, or new versions of the existing elements for overlays.

The current DOM overlay specification doesn't have any restrictions on HTML elements, so form input elements should work as expected. Typically, the application would use a transparent DIV element as the overlay, and this would then contain other elements placed within that DIV.

For example, this stock ticker experiment uses text input and select elements. And this model-viewer example with annotations has DOM nodes that move along with the model.

I don't have access to a dedicated screen reader, but Android's built-in "TalkBack" accessibility feature appears to work as expected for the content of the DOM layer.

This of course doesn't solve the overall problem of making applications accessible since the WebGL layer is separate.

There also needs to be some kind of access to the meshes that is nonvisual. Similar to how HTML declares elements on a page, there need to be similar elements of the XR space. That way, screen readers or other user agents can add their own tools for interacting with the XR meshes or objects that don't require the creator to know anything about their users interaction patterns.

This is unfortunately not easy. WebGL doesn't inherently have a concept of meshes or scene graphs. The API basically provides access to programmable shader pipelines that produce screen pixels, and there aren't any semantic hooks at the WebGL API level that seem suitable for annotating objects.

There have been multiple discussions of declarative 3D based on a DOM-style scene graph, but as far as I know this hasn't been getting much traction.

Manishearth commented 3 years ago

@RealJoshue108 None of my PRs have been accepted. #925 was marked as unsubstantive and I'm not sure what to do with #927 to determan affiliation. #930 still needs some testing.

Yeah, we had milestone'd them as Future since we didn't see them to be necessary for CR. I and @toji can still review them though.

@toji can we add a head size argument somewhere?

I don't think devices typically surface this property. Fine control over HRTF parameters is something that would have to happen through the WebAudio API i think.

How is the distance between two eyes determined?

Devices often have a calibration for this, we take in this calibration information in the form of the viewer eye offsets.

To fix these problems, there needs to be a mesh and object DOM, along with collision events. There needs to be a name requirement for the objects, and there needs to be some kind of way to show the environment. This can not be left up to the engines like AFrame. Otherwise I can wave goodbye to any XR that's not built with nonvisual users in mind.

There are rough plans for declarative XR, but they're probably something that will take a while to get to.

I can't emphasize how important it is that the Web XR spec allow easy sinking of the visuals and audio. Otherwise it's not XR, it's visual XR.

As it stands this requires some heavy collaboration between WebXR and the WebAudio people. Our rough plan is to at do this as a separate WebXR module, but not as a part of this one.

The hope is that XR frameworks make this easy to do, and in my understanding many of them do already.

frastlin commented 3 years ago

@Manishearth I am shocked that head mounted displays don't surface a calibration for the headphones. The Web Audio API has a default head size already, so if you give the audio listener the position somewhere around the user's head, it will use a default head size already. So if we know either the point at the top of the head, the nose, point in the center of the head, or point at the forehead, we could set that point to be the audio listener as it currently stands, and it would be good enough for most people. I think the audio API people need to add a way to set the head size, but if there is a way to obtain that information from a device, we should give that to the web audio API.

Manishearth commented 3 years ago

So if we know either the point at the top of the head, the nose, point in the center of the head, or point at the forehead

Unfortunately all we know is where the eyes are. There's a third point called the "viewer" that is typically roughly the nose (or midpoint of the eyes) but there's no requirement it be attached to any specific point on the face.

I think the audio API people need to add a way to set the head size, but if there is a way to obtain that information from a device, we should give that to the web audio API.

Right, that's something that should be filed on the WebAudio API IMO. But it's not very useful unless there is a way of getting this data from devices.

frastlin commented 3 years ago

If you know the position of each eye, can't you figure out what is in the center of the two?

Manishearth commented 3 years ago

If you know the position of each eye, can't you figure out what is in the center of the two?

Sure, but that's not going to tell you the head size.

frastlin commented 3 years ago

All you need with the current web audio API is that center point, and the size of the head is already estimated. If there is a way to obtain the head size from the unit, that is better, but not needed with the Audio API today.

Manishearth commented 3 years ago

All you need with the current web audio API is that center point, and the size of the head is already estimated. If there is a way to obtain the head size from the unit, that is better, but not needed with the Audio API today.

Right, all I was saying was that there's not enough to know the size of the head.

Anyway, I've filed https://github.com/immersive-web/proposals/issues/59 for a potential module that integrates WebXR and WebAudio.

immersive-web / webxr