immersive-web / webxr

Repository for the WebXR Device API Specification.
https://immersive-web.github.io/webxr/
Other
3k stars 386 forks source link

XRFrameOfReference needed for Projection Matrices on some backends? #412

Closed toji closed 6 years ago

toji commented 6 years ago

From the OpenXR SIGGRAPH presentation (https://www.khronos.org/assets/uploads/developers/library/2018-siggraph/04-OpenXR-SIGGRAPH_Aug2018.pdf) on Page 43 - "Viewport Projections"

In this slide we can see that OpenXR (as it stands now) has a function called xrGetViewportProjections that returns more-or-less everything that we report in our XRFrame. That includes the information necessary to build view and projection matrices. Because it's returning view information it's easy to surmise that the equivalent of an XRFrameOfReference (what OpenXR calls an XRSpace) needs to be provided to the function in order for it to return the right information. This also means, however, that projection matrices aren't available until the point the XRFrameOfReference is passed in, either.

This is a point of incompatibility with WebXR as it stands today. Currently the XRFrame reports an array of XRViews prior to any frame of reference being specified, which in turn contains projection matrices as a property. With a lot of native backends that seems fine, as projection parameters are frequently treated as static. Despite that, it's not too surprising to see some APIs may want to combine it into the same function call that gets the other space-dependent frame rendering data. In any case, we definitely want to address a known incompatibility with a native API.

There's two straightforward ways that I can see to address this:

Move view info into the call that takes a FrameOfRef

Probably would justify a method rename, but in essence we'd simply be moving the views array under what is today the getDevicePose call:

IDL

interface XRFrame {
  readonly attribute XRSession session;
  // No more views array here

  XRDevicePose? getDevicePose(XRFrameOfReference frameOfRef);
  XRInputPose? getInputPose(XRInputSource inputSource, XRFrameOfReference frameOfRef);
};

interface XRView {
  readonly attribute XREye eye;
  readonly attribute Float32Array projectionMatrix;
  readonly attribute Float32Array viewMatrix; // View matrix just becomes another property of the XRView
};

interface XRDevicePose {
  readonly attribute boolean emulatedPosition;
  readonly attribute Float32Array poseModelMatrix;
  readonly attribute FrozenArray<XRView> views; // Views are now here
};

And the render loop we describe in the explainer doesn't change too much:

function onDrawFrame(timestamp, xrFrame) {
  let pose = xrFrame.getDevicePose(xrFrameOfRef);
  gl.bindFramebuffer(gl.FRAMEBUFFER, xrSession.baseLayer.framebuffer);

  for (let view of pose.views) { // Previously was xrFrame.views
    let viewport = xrSession.baseLayer.getViewport(view);
    gl.viewport(viewport.x, viewport.y, viewport.width, viewport.height);
    drawScene(view.projectionMatrix, view.viewMatrix); // Previously was pose.getViewMatrix(view)
  }

  xrSession.requestAnimationFrame(onDrawFrame);
}

And that actually strikes me as a decent clarification of the API (modulo maybe the function name changing.) It's also got a nice side effect of reducing the number of JS function calls you're making by each frame, which is a minor but notable plus from an efficiency standpoint.

But, I'm also wondering if there's some sense in taking it a bit further?

Switch to a single-Frame-of-Reference-at-a-time system

This is more radical, but I'm starting to think it may be possible now, whereas previously I didn't. Given the changes proposed in #409, along with conversations on similar topics at the AR F2F, it's becoming apparent that Anchors (or similar mechanisms in the future) won't be something you query user poses/view matrices against, and instead will be something that you query for their position in relation to the larger Frame of Reference. That means that even in large-scale AR scenarios you're likely to only have one Frame of Reference at any given time.

So... what would it look like if we fully embraced that idea for the sake of API clarity?

IDL

partial interface XRSession {
  // The exact mechanics of this could be handled a few different ways.
  void setActiveFrameOfReference(XRFrameOfReference frameOfRef);
}

interface XRView {
  readonly attribute XREye eye;
  readonly attribute Float32Array projectionMatrix;
  readonly attribute Float32Array viewMatrix; // View matrix just becomes another property of the XRView
};

interface XRFrame {
  readonly attribute XRSession session;

  // Information from the XRDevicePose lives here now?
  readonly attribute FrozenArray<XRView> views;
  readonly attribute Float32Array poseModelMatrix;
  readonly attribute boolean emulatedPosition;

  // No call to get the pose, and the input pose call is simplified to just taking the XRInputSource.
  XRInputPose? getInputPose(XRInputSource inputSource);
};

And now the frame loop is even simpler, since we don't have to poll the pose with the FoR every frame.

function onDrawFrame(timestamp, xrFrame) {
  gl.bindFramebuffer(gl.FRAMEBUFFER, xrSession.baseLayer.framebuffer);

  for (let view of xrFrame.views) {
    let viewport = xrSession.baseLayer.getViewport(view);
    gl.viewport(viewport.x, viewport.y, viewport.width, viewport.height);
    drawScene(view.projectionMatrix, view.viewMatrix); // Previously was pose.getViewMatrix(view)
  }

  xrSession.requestAnimationFrame(onDrawFrame);
}

This would also potentially have some efficiency/performance benefits for the browser, since we now only have to sync pose data for the known active Frame of Reference, in addition to in the future anchors knowing in advance which Frame of Reference they will be queried against. That makes the IPC some browsers need to do a lot less messy. Not to mention this happens to map a bit better to how systems like Oculus' PC SDK work and would probably reduce developer confusion as well, especially when learning the API for the first time.

In order to make this approach practical we'd want to ensure that any legitimate cases where you'd want to use multiple frames of reference at once were adequately addressed, but at this point the only one I'm really aware of is if you're using Nell's proposed 'unbounded' frame of reference and want to transition between your current one and a newly recentered one to avoid precision issues. If support for that case can be built into the XRFrameOfReference itself, though, it's may become a non-issue.

Another potential issue that we'd have to work around is what happens to in-flight events during a Frame of Reference switch? (Thanks @NellWaliczek for pointing this issue out.) For example, if there's an input event that's been triggered by the native backend (which, again, may be a different process than the JS code) but the FoR is changed before the event is fired, what should we do? We could say that the change doesn't take effect until the next XRFrame fires, which may lead to developers misunderstanding a few events here and there, or we could force the browser to re-compute the event poses prior to firing? (Some systems may make that easy, others may not.) Or we could limit when the Frame of Reference could be changes somehow? I don't have a clear answer, but I do think it's a tractable problem.

Would love to hear opinions on this, especially from people who have worked with AR systems closely. Thanks!

NellWaliczek commented 6 years ago

One quick note to add... with option 2, calling setActiveFrameOfReference() probably shoud not change the current XRFrame object's properties. This then has a bit of impact on how we think about the lifetime of XRFrame objects as mentioned in issue #403

lincolnfrog commented 6 years ago

I support the proposal option (2), as I think it simplifies several of the APIs in addition to this one. For example, with AR's hit-test we were thinking we would need to supply a frame-of-reference to some of the calls, and we could instead use the current frame-of-reference.

One question I would have is that #409 calls for this API:

XRDevicePose? getDevicePose(optional XRFrameOfReference frameOfReference);

The idea behind that parameter being optional is to support inline sessions, I believe - so how would one express this in the new design? Does not setting a frame-of-reference imply an inline (frame-of-reference-free) session?

NellWaliczek commented 6 years ago

Through the work I've been doing on #396, I've been thinking about how we would properly support diorama-style experiences within a 2D page being viewed in a headset. I'm generally leaning in the direction of needing to add a new XRFrameOfReference type for this purpose - XRDioramaFrameOfReference if you will. The interesting thing about that, is that the number of views would be different not to mention the projection matrices be dependent on the FoR as well. (This is all related to issue #272, btw)

Given those things, I'm inclined towards option B. We'll need to be very crisp about when the active FoR can be changed and when those changes are applied, but I think that's manageable.

RafaelCintron commented 6 years ago

I'm generally not a fan of APIs that require setting global state at the right times with complicated rules about when things take effect. Web applications comprised of multiple sub-components typically struggle with state leakage.

In the future, we will likely have video and HTML layers. I expect applications that use these will go for long periods of time without having "frame" callback opportunities to set global state so that subsequent events return expected numbers. Forcing them to do so seems wrong to me.

To me, option A seems more straightforward to explain to people and reason about.

thetuvix commented 6 years ago

Very interesting! Option 2 seems like it's worth exploring as an API simplification.

We'd talked separately about adding some sort of setDefaultFrameOfReference call that provides a default value for the frame parameter in various other calls, so going a bit further to actually have a stateful "active frame" and implicitly delivering all poses within it seems like it could improve API ergonomics. If we presume that the active frame of reference is a fundamental app decision to be made, and that multiple middleware components should all just respect the app's decision rather than themselves fight over that global value, perhaps we could avoid part of the gotcha @RafaelCintron points out above. The key remaining gotcha will be the details about when you can call setActiveFrameOfReference and when it takes effect, to ensure we don't end up with net complexity or race conditions in the API.

@toji:

This would also potentially have some efficiency/performance benefits for the browser, since we now only have to sync pose data for the known active Frame of Reference, in addition to in the future anchors knowing in advance which Frame of Reference they will be queried against.

While there may be developer simplicity benefits here, I don't think this buys us out of pose syncs for inactive frame instances that the app is still keeping alive. A developer can still hold onto two XRFrameOfReference instances and call frame1.coordinateSystem.getTransformTo(frame2.coordinateSystem) (or frame1.getTransformTo(frame2) if we move back to the "is-a" design). This is the way that an app that is generally using an "unbounded" frame as its base coordinate system can reason about the "bounded" origin as a model transform for some subset of stage-bound content. (while an ever-shifting bounded origin isn't convenient for simple room-scale VR apps, we shouldn't make it harder to get the room origin's per-frame pose than to get per-frame poses for other plane/mesh/free-space anchors) For the UA, this should hopefully just be another 2 poses to transfer around, and so it's hopefully not a burden for a UA that would otherwise support transferring an arbitrary number of anchor poses.

Note that full correctness in the current API design requires a UA to support relating the position of two known anchors in a disjoint map fragment to one another (anchor1.getTransformTo(anchor2)), even if they can't be related to the user's position at the present moment. Therefore, a UA that internally pre-fetches all anchors as poses within the active frame of reference may not be able to fulfill getTransformTo calls in all cases. However, given the very limited anchor and hit-test support we are considering for WebXR 1.0, we could consider skipping support for anchor-to-anchor getTransformTo calls for now, replacing it with a similar poseModelMatrix approach on anything locatable such as plane/face/mesh anchors.

If we explore that path, we should see how we'd then extend the WebXR API to support the scenarios discussed in #384. For example, if we consider planes to always be static, XRPlaneAnchor.poseModelMatrix would be a nice simple pattern for developers. However, we know that other detected anchors like faces will move. (and some platforms may choose to track dynamic planes) Would we then introduce XRFaceAnchor.getPose(XRFrame), which has the reverse pattern to our other time-indexed poses? Or would we then need to introduce XRFrame.getFacePose(XRFaceAnchor)? The latter does not feel like a sustainable pattern - over time, XRFrame will end up containing .get...Pose methods for every other dynamic entity that needs to be located.

An option there could be to double-down on an XRFrame.getPose(coordinateSystem, baseCoordinateSystem?) approach for all poses, including view poses, input poses, anchor poses and any other future poses. This approach may give us the benefits of all the approaches above, and can result in simple code if we move back to the "is-a" pattern for XRCoordinateSystem:

Perhaps then we'd rename XRCoordinateSystem to XRPosable or XREntity?

Lots to think about!

thetuvix commented 6 years ago

One related note that this change calls to mind - we should generally aim to avoid the term Device or DevicePose when referring to the frame's primary view pose. For a zSpace-style monitor, or for a CAVE system where the head tracker is external to the head itself, the view origin logically represented by getDevicePose() represents the location of the head, not any particular device.

One thing I like about option 2 is that it naturally removes the entire notion of DevicePose from the API. If we ended up going with option 1, we could make a similar change to refer to ViewPose or such (bikeshedding TBD) that still accomplishes that goal:

interface XRFrame {
  readonly attribute XRSession session;
  // No more views array here

  XRViewPose? getViewPose(XRFrameOfReference frameOfRef);
  XRInputPose? getInputPose(XRInputSource inputSource, XRFrameOfReference frameOfRef);
};

interface XRView {
  readonly attribute XREye eye;
  readonly attribute Float32Array projectionMatrix;
  readonly attribute Float32Array viewMatrix; // View matrix just becomes another property of the XRView
};

interface XRViewPose {
  readonly attribute boolean emulatedPosition;
  readonly attribute Float32Array poseModelMatrix;
  readonly attribute FrozenArray<XRView> views; // Views are now here
};
thetuvix commented 6 years ago

One interesting gotcha with option 2 is when we do get to multiple compositor layers. If a secondary WebGL layer is rendering a controller, it may wish to reproject based on predicted controller motion, rather than head motion. If so, it may want to render that second layer relative to a different root coordinate frame in some way. If we push too strongly on having a single active frame of reference without allowing apps to override that for a given API call, we may block off some paths for rendering multiple layers.

This leans me back towards prefering a "default frame of reference" model where functions do take an XRCoordinateSystem/XRFrameOfReference but it can often be omitted to use your base frame. This gives more flexibility than a "single active frame of reference" model where apps are always handed poses in the active frame, with no chance to override. Otherwise, we may see apps juggling a mutable "current active" global frame of reference to render different layers of their scene, which seems messier.

NellWaliczek commented 6 years ago

Fixed by #422