Expose handheld AR as a camera stream with pose metadata

tangobravo commented 2 years ago

There's a couple of different ways to view ARCore / ARKit sessions: 1) A camera stream with additional pose and anchor metadata along with each frame 2) An implementation detail for an abstract "AR" device

The native ARKit and ARCore APIs are closer to the former, and WebXR's immersive-ar session is more like the latter. The current immersive-ar session is a great fit for device-agnostic content, but doesn't cover all handheld AR use cases, as I argued in general terms in immersive-web/webxr-ar-module#77.

Although adding extensions to the current WebXR approach (such as DOM Overlay and Raw Camera Access) would cover some more use cases, an alternative API that is closer to (1) is perhaps the more straightforward way to enable all of the native app ARCore and ARKit use cases on the web.

One particular example unlikely to be well-served by the device-agnostic + extensions approach is the discussion of "deferring camera output" in https://github.com/immersive-web/raw-camera-access/issues/7 - the WebXR immersive sessions are primarily motivated by controlling presentation of content and minimizing latency, it would feel odd to add an extension to effectively side-step those parts.

Exposing native ARCore / ARKit data as a camera stream would make it much easier to polyfill as a fallback and integrate well with the rest of the web platform on mobile.

There's a couple of options for how to go about this:

Directly as a MediaStream from getUserMedia

ARKit and ARCore both leverage one of the standard device cameras. They expose a different set of constraints and capabilities in terms of resolution and frame rate, so it probably makes sense that they would be treated as distinct cameras if they're exposed via getUserMedia() directly.

The main problem is getting frame-synced metadata isn't particularly well-specified on the web for Media Streams.

There's requestVideoFrameCallback that passes metadata to the callback, but is best-effort in terms of being frame-accurate.

MediaStreamTrackProcessor is aimed at use cases requiring CPU-side access to the frame data, isn't very widely supported, and has quite a few outstanding spec issues where consensus is yet to be reached.

ImageCapture.grabFrame() returns an ImageBitmap of the next frame but as currently written in the spec doesn't contain any metadata. I did notice this has been previously suggested by @thetuvix in the raw-camera-access discussion here.

As a new WebXR session type

Given none of the existing MediaStream APIs are a great fit, it is perhaps best to design a new WebXR API for this.

As an initial suggestion, how about a camera-ar session type? It would vend XRCameraFrame objects containing some handle to the camera frame, probably just an ImageBitmap, along with camera model information. Poses would be obtained via the usual XRSpace mechanisms.

This session type wouldn't explicitly need to be hooked up to a WebGL canvas, as the whole mechanism of presentation is entirely up to the site. The common case would be to texImage2d(imageBitmap) and rendering via WebGL into a canvas.

One area to consider is interaction with hit testing - XRTransientInputHitTestSource wouldn't be supported as the session doesn't have an associated canvas. XRHitTestSource say from a viewer reference space should still be fine though. One of the real-world geometry proposals (plane detection being most relevant on mobile) would allow the site to do synchronous hit testing too.

I think overall this approach is significantly cleaner that the current proposal in https://github.com/immersive-web/raw-camera-access. It should also work for both handheld AR and headsets, which might want to run an immersive-ar session at 90 FPS, and a camera-ar one at 30, for example.

The fact it is a closer match to the underlying native mobile APIs means it is likely that many more app-based ARKit and ARCore experiences could be supported on the web, even those wanting things such as deferred presentation of the camera frames.

Obviously this approach gives the site full camera access, so permission policy would need to match that for getUserMedia() or a WebXR immesive-ar session with the "camera-access" feature requested.

In practice for us at Zappar, most of our projects are likely to require full camera access as they either want to offer a smooth UX for client-side capturing and sharing of photos and videos, or they want to additionally leverage our other tracking types in WebAssembly. Simply modelling the native tracking session as a camera stream and leaving it up to the user (and libraries) to handle presentation feels the most straightforward solution to me.

haywirez commented 2 years ago

Hey all, maybe I'm misunderstanding the proposal, but shouldn't the camera feed be primarily exposed to the WebGL/WebGPU shader context as a (cubemap/sphere) texture or a video texture? A javascript side API is useful, but too slow for drawing.

tangobravo commented 2 years ago

My proposal was to expose the frame as an ImageBitmap on the JS side. ImageBitmap is already part of the Web API, and despite the somewhat confusing name, is just a handle to an image rather than implying the data itself is on the CPU-side. https://developer.mozilla.org/en-US/docs/Web/API/ImageBitmap - the description strongly suggests the data itself is already somewhere GPU-accessible, and that texImage is a primary use case (or 2D canvas drawing, but many implementations use the GPU for that too).

In the implementation I'd assume this would be backed by a texture already and so gl.texImage2d(imageBitmap) could be essentially free.

immersive-web / webxr-ar-module

Expose handheld AR as a camera stream with pose metadata #78

Directly as a MediaStream from getUserMedia

As a new WebXR session type