immersive-web / webxr

Repository for the WebXR Device API Specification.
https://immersive-web.github.io/webxr/
Other
2.95k stars 377 forks source link

WebXR / WebRTC Overlaps and Integrations #1295

Open tangobravo opened 1 year ago

tangobravo commented 1 year ago

With TPAC coming up, there was a suggestion the Immersive Web and WebRTC groups could meet to discuss potential integrations of XR data into the WebRTC spec. WebRTC has come up in response to various discussions over the years, so it might be worth narrowing down a bit on what is practically being discussed.

There’s the use-case outlined here for remote assistance.

However on top of that, getUserMedia has often been suggested as the general way to support access to a camera stream for local processing, either on CPU or GPU, as an alternative to the current raw-camera-access draft spec for “synchronous” access to ARCore frames.

Media Streams in general don’t seem great for per-frame metadata, as I mentioned in https://github.com/immersive-web/webxr-ar-module/issues/78 - requestVideoFrameCallback is best-effort (not good enough for WebXR), and MediaStreamTrackProcessor isn’t widely supported and is aimed at CPU processing, which might involve some costly readback / format conversion.

Using ImageCapture.grabFrame() might be the best bet (although that is also Chromium-only at the moment, it doesn't feel a massive spec).

grabFrame returns a promise that resolves to an ImageBitmap of the next frame - which despite the name is just a handle to the data, which can be on the GPU. As currently written in the spec there's no metadata with the ImageBitmap. I did notice this has been previously suggested by @thetuvix in the raw-camera-access discussion here.

With that proposal from @thetuvix, the ImageBitmap from grabFrame() is passed in to an XRSession to retrieve the specific pose data relating to that frame, which feels like a reasonable suggestion to me.

One downside of requiring a running XRSession to access the pose data is that on mobile platforms immersive-ar would still be required, which has some practical issues. An inline-tracked session type in combination with an XR MediaStream would do it though, so that can certainly be addressed down the line.

tangobravo commented 1 year ago

The main reasons I proposed a separate camera-ar session (here) were due to the frame-accurate metadata challenges, and to avoid the need for a separate MediaStream and XRSession with independent life-cycles.

Using grabFrame is a reasonable solution to accessing per-frame data, and I can see there are some potential advantages in separate lifecycles too:

There are also some general advantages of the getUserMedia / MediaStream approach:

So in general I'd be happy to throw my (admittedly insignificant!) weight behind this approach rather than camera-ar, if this is the direction that has implementor interest. Just to note a few potential drawbacks:

tangobravo commented 1 year ago

One item to raise, but that I haven't done much thinking about personally, is whether users would need a timestamp for the current camera image.

In terms of spawning anchors into a tracked local space, then just having the correct spatial relationship feels sufficient (ie the camera's pose would be the pose it was in when the frame was captured). I think that would allow correctly mapping poses from the camera into the local space without needing timestamps.

One area where timestamp knowledge is likely required is if there is also a tracked controller and you wish to know where that appears in the camera image. I assume by default simply using XRFrame getPose to request the pose of the controller in the camera space would just do the spatial conversion, ie it would return where the predicted controller position for rendering the next frame is relative the to pose the camera was in when it captured the latest camera frame. That's not the same thing as saying where was the controller relative to the camera at the timestamp at which the image was captured.

I can't think of a huge number of reasons people would want to know that, but it serves as a thought experiment for whether the difference in timestamps needs to be made explicit.

I would note that a separate camera-ar session would simplify timestamp management - everything reported in that session would relate to the latest camera frame time, and updates are triggered on new camera frames. The other immersive session types are driven by the need to render a new frame and all poses reported in them are predicted ones for the expected display time for the frame being rendered.