WebXR / WebRTC Overlaps and Integrations

With TPAC coming up, there was a suggestion the Immersive Web and WebRTC groups could meet to discuss potential integrations of XR data into the WebRTC spec. WebRTC has come up in response to various discussions over the years, so it might be worth narrowing down a bit on what is practically being discussed.

There’s the use-case outlined here for remote assistance.

However on top of that, getUserMedia has often been suggested as the general way to support access to a camera stream for local processing, either on CPU or GPU, as an alternative to the current raw-camera-access draft spec for “synchronous” access to ARCore frames.

Media Streams in general don’t seem great for per-frame metadata, as I mentioned in https://github.com/immersive-web/webxr-ar-module/issues/78 - requestVideoFrameCallback is best-effort (not good enough for WebXR), and MediaStreamTrackProcessor isn’t widely supported and is aimed at CPU processing, which might involve some costly readback / format conversion.

Using ImageCapture.grabFrame() might be the best bet (although that is also Chromium-only at the moment, it doesn't feel a massive spec).

grabFrame returns a promise that resolves to an ImageBitmap of the next frame - which despite the name is just a handle to the data, which can be on the GPU. As currently written in the spec there's no metadata with the ImageBitmap. I did notice this has been previously suggested by @thetuvix in the raw-camera-access discussion here.

With that proposal from @thetuvix, the ImageBitmap from grabFrame() is passed in to an XRSession to retrieve the specific pose data relating to that frame, which feels like a reasonable suggestion to me.

One downside of requiring a running XRSession to access the pose data is that on mobile platforms immersive-ar would still be required, which has some practical issues. An inline-tracked session type in combination with an XR MediaStream would do it though, so that can certainly be addressed down the line.

The main reasons I proposed a separate camera-ar session (here) were due to the frame-accurate metadata challenges, and to avoid the need for a separate MediaStream and XRSession with independent life-cycles.

Using grabFrame is a reasonable solution to accessing per-frame data, and I can see there are some potential advantages in separate lifecycles too:

An immersive-ar session running that only requires the camera stream for a relatively short period would allow limiting permission prompts and camera in-use notifications to just part of the experience
Having the camera-space poses available relative to all the other spaces in the full immersive-ar session I can see would be useful [although do we need to update / change APIs to handle timestamp differences?]

There are also some general advantages of the getUserMedia / MediaStream approach:

Uses the existing Web API for client-side camera access
WebRTC allows video streaming for remote assistance use cases

So in general I'd be happy to throw my (admittedly insignificant!) weight behind this approach rather than camera-ar, if this is the direction that has implementor interest. Just to note a few potential drawbacks:

For purely local client-side use MediaStreams may involve more overhead. It might have changed, but Chromium used to process all MediaStream frames on the CPU in some fixed pixel format (fully-planar YUV IIRC) that involved some CPU-side format conversion. There was quite a lot of overhead associated just with starting the stream.
gUM streams on mobile devices are rotated when the screen rotates. This generally makes sense so the local preview and any remote users see the video the "right way up" but does make computer vision more challenging. Frame-accurate metadata around any rotation would be needed - although perhaps it's just a 90-degree rotation in the camera pose and the camera model on those events (as the ImageBitmap switches between portrait and landscape). Some form of explicit "camera rotation" might be handy too so users can get camera poses in some non-rotated camera-relative space that won't have those rotation discontinuities.

One item to raise, but that I haven't done much thinking about personally, is whether users would need a timestamp for the current camera image.

In terms of spawning anchors into a tracked local space, then just having the correct spatial relationship feels sufficient (ie the camera's pose would be the pose it was in when the frame was captured). I think that would allow correctly mapping poses from the camera into the local space without needing timestamps.

One area where timestamp knowledge is likely required is if there is also a tracked controller and you wish to know where that appears in the camera image. I assume by default simply using XRFrame getPose to request the pose of the controller in the camera space would just do the spatial conversion, ie it would return where the predicted controller position for rendering the next frame is relative the to pose the camera was in when it captured the latest camera frame. That's not the same thing as saying where was the controller relative to the camera at the timestamp at which the image was captured.

I can't think of a huge number of reasons people would want to know that, but it serves as a thought experiment for whether the difference in timestamps needs to be made explicit.

I would note that a separate camera-ar session would simplify timestamp management - everything reported in that session would relate to the latest camera frame time, and updates are triggered on new camera frames. The other immersive session types are driven by the need to render a new frame and all poses reported in them are predicted ones for the expected display time for the frame being rendered.

immersive-web / webxr

WebXR / WebRTC Overlaps and Integrations #1295