immersive-web / webxr

Repository for the WebXR Device API Specification.
https://immersive-web.github.io/webxr/
Other
2.95k stars 375 forks source link

Consider hooking up sound source nodes in the API somehow #390

Open cwilso opened 5 years ago

cwilso commented 5 years ago

There have been requests to add sound to the scope of the WebXR API . There are two aspects to this - first, that we should manage the audio input/outputs associated with an XR device. This is already covered by https://github.com/immersive-web/webxr/issues/98. The second aspect is enabling developers to easily position sound sources in the virtual space, and use an HRTF (Head-Related Transfer Function) or multi-speaker setup to properly "position" the sound.

It is relatively straightforward to use Web Audio's PannerNode to hook up between a posed sound source and the head pose - in fact, three.js does exactly this, with a PositionalAudio source object. However, the problem lies in keeping the headpose (and the sound source pose) updated on a high enough frequency - ideally, letting the audio thread directly get headpose info somehow or the like.

(Note that I don't consider this a high-priority today - Issue #98 is more important, and even that is a future enhancement - but I wanted to capture it.)

blairmacintyre commented 5 years ago

Why wouldn't we use existing web api's (getUserMedia, webAudio)? How are they lacking, in a way that couldn't be solved by updating them?

As with accessing video, it seems like enhancing existing web APIs, or perhaps somehow creating a binding between them, would be preferable to creating a new, different API. For audio, in particular, webAudio seems pretty good, and if the issue is synchronizing the headpose of the audio for spatialization, this seems like something that could be solved with a small additional feature in webAudio.

Manishearth commented 5 years ago

The problem is that you have to manually send the pose data to webAudio, which carries some overhead.

Ideally, WebAudio would have a mode where it could be told "the position of this panner node is to reflect the head pose" and then it uses realtime head pose data.

cwilso commented 5 years ago

@Manishearth hit the nail on the head. We would (presumably) use media streams and Web Audio. Web Audio even already has 3D positioning - and yes, the key missing piece is synchronizing the poses (it's not just headpose - it's also the pose of each individually-placed sound-producing object) - or more to the point, minimizing the latency of keeping those poses updated, and getting them updated in the audio thread on a regular basis.

It's possible this is just advice and best practice for web audio; it's possible we'd want a small feature tying web audio (PannerNode or a derivative) to an XR Session and some poses. It may not turn in to an actual feature in XR - but that all needs to be explored, and this seems like the best place to track it to me.

blairmacintyre commented 5 years ago

@cwilso @Manishearth yes, exactly the approach I was imagining. Getting someone to explore this would be great.

Manishearth commented 5 years ago

@kearwood and I discussed this a bit and he brought up that a nice API for this would be to allow attaching XRSpaces to the AudioListener and PannerNodes, and implementors can internally use the XRSpace reference from the render thread to quickly query (or request push updates for) positional information for the relevant objects

kearwood commented 5 years ago

It seems that for a V1 of this integration, it may be reasonable to implement a function that would be called within each XRSession.requestAnimationFrame callback to explicitly synchronize information about poses across to WebAudio.

This provides benefit in that there will be no need to manually copy members across, but would not be a "set and forget" that updates its position continuously.

There may need to be some kind of smoothing or interpolation to avoid pops and clicks as tracking state is lost, regained, or operating at various sampling rates.

The focus could be on supporting headphone-based 3d spatialization for the majority of the cases, and additively support things such as speaker arrays in a CAVE system later.

Security implications of leaking poses are avoided by requiring state to be explictly transferred during XRSession.requestAnimationFrame.

kearwood commented 5 years ago

As a page will not be receiving pose updates while blurred (eg, while a system dialogue is displaying a permission request), content would be required to explicitly handle ducking and/or muting directional audio that may be distracting and/or feel broken when no longer tracking the pose updates.

kearwood commented 5 years ago

While it may be interesting to use XR device sensing and world awareness for features such as selecting an appropriate reverb impulse response to match the room shape, this would be a non-goal for v1 as the security model required has not been described.

cwilso commented 5 years ago

(Summing up lunchtime conversation between @kearwood, @Manishearth and myself)

I still think "v1" of audio-in-XR is what Kip described, and is possible today: developers can implement code within their XRSession.requestAnimationFrame callback to explicitly synchronize the headpose and XRReferenceSpace to WebAudio PannerNodes and AudioListener, respectively.

v2 of audio-in-XR is making this connection automatic - we can incubate this as partial interfaces off Web Audio API interfaces for AudioListener and PannerNode, I expect. The implementation will raise more security and privacy concerns (e.g. Kip pointed out that we'd have to make sure that the blurring and blocking of pose data that happens when prompts are on-screen, e.g., would have to be applied to this too).

v3 is probably looking at more advanced world awareness (e.g. reverb based on room), which likely has even more concerning security and privacy implications.

frastlin commented 4 years ago

For V1, I would like to have some language in explainer.md under the "Viewer tracking" header that show exactly how to currently connect the web audio API to WebXR. I'm thinking there could be a heading level 4 for visual viewing, and another heading level 4 for auditory viewing. We need to convert the WebXR's orientation quaternion into a direction vector in the frame. I think we need to do a Rodrigues formula. I'm not sure if we just take the first item in the views list, and I'm not sure of the exact formula needed. Here is an example with the location for the conversion code as a set of comments, because I'm not exactly sure what needs to be done:

// initialize the audio context
const AudioContext = window.AudioContext || window.webkitAudioContext;
const audioCtx = new AudioContext();

function onDrawFrame(timestamp, xrFrame) {
    // Do we have an active session?
    if (xrSession) {
        let listener = audioCtx.listener;

        let pose = xrFrame.getViewerPose(xrReferenceSpace);
        if (pose) {
            // Run imaginary 3D engine's simulation to step forward physics, PannerNodes, etc.
            scene.updateScene(timestamp, xrFrame);

            const view = pose.views[0];
            // Do something here to get the rotation and position in a direction vector from the view quaternion
            // set all the listener attributes to have a value of the vector.

        }
        // Request the next animation callback
        xrSession.requestAnimationFrame(onDrawFrame);
    }
}
Manishearth commented 4 years ago

You don't need any special math for this, provided that you place your panner nodes appropriately based on your xrReferenceSpace, you can just use getViewerPose's transform's position/orientation directly.

Ideally, though, we should have an API that allows for realtime linkage behind the scenes, where you "set and forget" an XRSpace on the listener node and the updates happen without going through JS.

frastlin commented 4 years ago

OK, what xrReferenceSpaces can translate directly to the vector in Web Audio? Also, what is the order of arguments? I would like to put an example in explainer.md that shows how to do this now, because any application with 3D/VR sound will need to use this algorithm.

Manishearth commented 4 years ago

OK, what xrReferenceSpaces can translate directly to the vector in Web Audio?

It doesn't matter, as long as it's origin is stationary (so, not viewer). Just use local or something. Everything is relative in WebAudio, so as long as all numbers are in the same coordinate space it should be fine. I don't know what you mean by the order of arguments, listener has setPosition and setOrientation methods. Just use those, and place the panner nodes appropriately. There's no algorithm here.

frastlin commented 4 years ago

If the values are the same, then the example would look something like this?

const view = pose.views[0];
[ listener.positionX.value, listener.positionY.value, listener.positionZ.value, listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value, listener.upX.value, listener.upY.value, listener.upZ.value ] = view

Will this work? The values in the view are a 4 by 4 array (16 values), and here we are looking for 9 values. This is what I mean by order of arguments.

Manishearth commented 4 years ago

Just use pose.transform.position and pose.transform.orientation with listener.setPosition() and listener.setOrientation(). In setPosition() make sure to normalize by dividing x, y, and z by w.

frastlin commented 4 years ago

Perfect, thank you! So the example would be:

// initialize the audio context
const AudioContext = window.AudioContext || window.webkitAudioContext;
const audioCtx = new AudioContext();

function onDrawFrame(timestamp, xrFrame) {
    // Do we have an active session?
    if (xrSession) {
        let listener = audioCtx.listener;

        let pose = xrFrame.getViewerPose(xrReferenceSpace);
        if (pose) {
            // Run imaginary 3D engine's simulation to step forward physics, PannerNodes, etc.
            scene.updateScene(timestamp, xrFrame);

            // Set the audio listener to face where the XR view is facing
            [ listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value ] = pose.transform.orientation;
            // Set w to 1 as stated in the WebXR spec:
            const w = 1;
            // Set the audio listener to travel with the WebXR user position
            [ listener.positionX.value, listener.positionY.value, listener.positionZ.value ] = pose.transform.position.map(p=>p/w);

        }
        // Request the next animation callback
        xrSession.requestAnimationFrame(onDrawFrame);
    }
}
Manishearth commented 4 years ago

Oh if w is always 1 you don't need to divide, then.

Manishearth commented 4 years ago

But yeah, that's correct. You can use setPosition() and setOrientation() to do it atomically, different browsers handle checkpointing differently here.

frastlin commented 4 years ago

Those two functions are unfortunately deprecated

klausw commented 4 years ago

[ listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value ] = pose.transform.orientation;

That looks wrong. pose.transform.orientation is a quaternion which describes a 3D rotation, you can't just take its first three components and assign them to a direction vector. Instead, you'd need to take a forward vector, i.e. (0, 0, -1) assuming -z is forward, and apply the quaternion to it as a rotation operation.

Following https://en.wikipedia.org/wiki/Quaternions_and_spatial_rotation#Quaternion-derived_rotation_matrix , the result should be -1 (the Z component of the unrotated forward vector) times the third column of the rotation matrix.

fwd.x = -2 * (q.x*q.z + q.y * q.w);
fwd.y = -2 * (q.y*q.z - q.x * q.w);
fwd.z = 2 * (q.x * q.x + q.y * q.y) - 1;

This is untested and may be the wrong sign or transposed, but that's roughly how it should look, assuming the input quaternion is normalized. If you're using a JS framework, that should provide utility methods for such things.

Manishearth commented 4 years ago

Oh, I didn't realize WebAudio orientations weren't quaternions, my bad

klausw commented 4 years ago

If you don't want to deal with quaternions, using the matrix representation may be more useful. See https://immersive-web.github.io/webxr/#matrices for details.

The pose matrix's top left 3x3 elements provide unit column vectors in base space for the posed coordinate system's x/y/z axis directions, so you could use the negative of the third column directly as a forward vector corresponding to the -z direction:

let m = pose.transform.matrix;
let fwd = {x: -m[8], y: -m[9], z: -m[10]};
frastlin commented 4 years ago

So this would be the actual example:

// initialize the audio context
const AudioContext = window.AudioContext || window.webkitAudioContext;
const audioCtx = new AudioContext();

function onDrawFrame(timestamp, xrFrame) {
    // Do we have an active session?
    if (xrSession) {
        let listener = audioCtx.listener;

        let pose = xrFrame.getViewerPose(xrReferenceSpace);
        if (pose) {
            // Run imaginary 3D engine's simulation to step forward physics, PannerNodes, etc.
            scene.updateScene(timestamp, xrFrame);

            // Set the audio listener to face where the XR view is facing
            // First, convert from a quaternion to a forward vector. The pose.matrix top left 3x3 elements provide unit column vectors in base space for the posed coordinate system's x/y/z axis directions, so we use the negative of the third column directly as a forward vector corresponding to the -z direction.
            const m = pose.transform.matrix;
            [ listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value ] = [-m[8], -m[9], -m[10]];
            // Set the audio listener to travel with the WebXR user position
            [ listener.positionX.value, listener.positionY.value, listener.positionZ.value ] = pose.transform.position;

        }
        // Request the next animation callback
        xrSession.requestAnimationFrame(onDrawFrame);
    }
}
klausw commented 4 years ago

I think you also need to set the listener "up" vector. Assuming you're using the usual convention that +Y is up, you can use the matrix's Y unit vector for that: (m[4], m[5], m[6])

Just for completeness, you could use (m[12], m[13], m[14]) for the position, it's the posed space's origin position in the base coordinate system. That should equal pose.position.xyz, but it's an alternative if you don't want to mix matrix and decomposed values in a single snippet.

frastlin commented 4 years ago

OK, this looks as if it is pretty close to being an example we can put in explainer.md:

// initialize the audio context
const AudioContext = window.AudioContext || window.webkitAudioContext;
const audioCtx = new AudioContext();

function onDrawFrame(timestamp, xrFrame) {
    // Do we have an active session?
    if (xrSession) {
        let listener = audioCtx.listener;

        let pose = xrFrame.getViewerPose(xrReferenceSpace);
        if (pose) {
            // Run imaginary 3D engine's simulation to step forward physics, PannerNodes, etc.
            scene.updateScene(timestamp, xrFrame);

            // Set the audio listener to face where the XR view is facing
            // First, convert from a quaternion to a forward vector. The pose.matrix top left 3x3 elements provide unit column vectors in base space for the posed coordinate system's x/y/z axis directions, so we use the negative of the third column directly as a forward vector corresponding to the -z direction.
            // The given pose.transform.orientation is a quaternion and not a forward vector, so is not used with web audio
            const m = pose.transform.matrix;
            // Set forward facing position
            [ listener.forwardX.value, listener.forwardY.value, listener.forwardZ.value ] = [-m[8], -m[9], -m[10]];
            // set the horizontal position of the top of the listener's head
            [ listener.upX, listener.upY, listener.upZ ] = [ m[4], m[5], m[6] ];
            // Set the audio listener to travel with the WebXR user position
            // Note that pose.transform.position does equal [m[12], m[13], m[14]]
            [ listener.positionX.value, listener.positionY.value, listener.positionZ.value ] = [m[12], m[13], m[14]];

        }
        // Request the next animation callback
        xrSession.requestAnimationFrame(onDrawFrame);
    }
}
frastlin commented 4 years ago

OK, so the above example works for all the XRReferenceSpaces except for the basic "viewer". To make viewer work, we need to just remove the set position, as the position never moves with viewer. Will the pos matrix values be 0 in viewer? Or will the example need to check what XRReferenceSpace is being used? What caveats is there of setting the [0, 0, 0] listener pose to the native origin of WebXR?

Manishearth commented 4 years ago

Why do you want to use the viewer reference space? The whole point is to use a reference space whose origin is stationary, which is roughly true for all of them except "viewer". If your reference space isn't stationary you will have to keep updating the panner node coordinates to work in that space.

frastlin commented 4 years ago

I'm wondering if there needs to be a seperate example for the viewer mode, or if the above will work for viewer as well.

Manishearth commented 4 years ago

What do you mean by "viewer mode"? Which reference space you pick is irrelevant provided you pick one which is roughly stationary, in all of these cases the code will have the same result provided you pick appropriate coordinates for all the panner nodes. In all of these cases the listener will be positioned where the viewer is, because you're using getViewerPose().

The "viewer" reference space isn't stationary, it follows the viewer, and getViewerPose(viewerSpace) returns usually constant values, making it useless for this.

frastlin commented 4 years ago

I submitted a PR with the example to explainer.md, please edit and comment: https://github.com/immersive-web/webxr/pull/930