immersive-web / raw-camera-access

Spec draft: https://immersive-web.github.io/raw-camera-access/. Repository for experimentation around exposing raw camera access through WebXR Device API. Feature leads: Piotr Bialecki, Alex Turner, Nicholas Butko
Other
39 stars 12 forks source link

Explainers should lead with sample code, should not include IDL #14

Open slightlyoff opened 2 years ago

slightlyoff commented 2 years ago

Via the I2S thread, I started reading the explainer, and it seems like several best practices are missing and design directions are not enunciated:

alcooper91 commented 2 years ago

Thanks! Pitor is out for a couple of weeks; so I'm posting planned explainer changes here as well to help facilitate this conversation while I work on the PR/figure out if there's anything I need to do to merge them while he's out.

  • the end-user problem/benefit is not front-and-center

Edit: The PR below should have addressed this.

  • the explainer includes proposed IDL, which should be relegated to a spec draft

Easy enough to remove; I feel like in the past this had been a required portion of explainers; but maybe just for early explainers?

  • the explainer does not identify problems related to XR camera access that are not going to be addressed by the design (non-goals)

Edit: The PR below should've addressed this.

-the explainer is hand-wavey about why this is not being proposed as an extension to getUserMedia()

This is not being proposed as an extension to getUserMedia because it is actually a goal of the spec to deliver the camera image alongside the rest of the WebXR frame data. Going through getUserMedia() would not allow this synchronization.

-why does the explainer not provide a way to also upload "raw" image data to a WebGL texture?

I'm not sure I understand the question here? The API shape directly exposes the camera as a WebGL texture?

-for a "raw" image, the API seems to lack color space controls/hints/output, and does not document the format of the resulting texture. Why not?

Color space is an unresolved issue for WebXR overall, see for example https://github.com/immersive-web/webxr/issues/1140 and https://github.com/immersive-web/webxr/issues/988

-how does this feature handle multiple cameras? Stereo cameras? Why are views and cameras always linked 1:1?

I’ve touched on this a bit in my proposed language for the new non-goals section, but these are out of scope for this particular API. Future alternative APIs may support multiple cameras that aren't aligned with a single view, but this requires applications to handle asynchronous poses and has different privacy concerns, for example the user would need to be aware of the possibility that cameras may be capturing parts of the environment that are not visible in the XR view.

  • how will this work in the context of Offscreen Canvas?

WebXR immersive sessions can use a XRWebGLRenderingContext created from an OffscreenCanvas, see for example this sample and its source. In that case, both the opaque framebuffer used for output and the raw camera image GL texture would correspond to the OffscreenCanvas-based XRWebGLRenderingContext.

-the considered alternatives section only contains a single other design, when we can easily imagine many different designs

I think a lot of the alternatives mentioned fall into some non-goals that will now explicitly be called out (mainly that this API is explicitly targeting smartphone AR which has a 1:1 Frame:Camera relationship, and that a broader API is needed for Headsets). Please see the discussion at the end of the explainer for more context. Even with plans for this more general API, there’s still developer interest in a simplified API for this specific 1:1 synchronized use case.

-the spec doc does not link to the explainer -the explainer does not link to the spec

These two are simple enough edits to make, though not something I've seen in other specs/explainers in the past.

Edit: There didn't appear to be a clear good point to add the link back to the explainer; I did what I thought was likely the best, but if you have a preferred mechanism, please let me know.

slightlyoff commented 2 years ago

Hey Alex,

Thanks for the updates to the explainer and the responses here.

So I'm pretty comfortable with how XR rendering works as I helped review the original WebVR APIs while serving on the TAG. With that background, I am a bit surprised that the Explainer doesn't do a fuller job of explaining why we don't use getUserMedia() via extensions. E.g., getUserMedia() already has a system for camera toggling, and color space information from those cameras will also need to be sorted at some point.

At an architectural level, those solutions should have a uniform API, if not an identical mechanism. E.g., do we need to design a new delegation mechanism for permission requests here? Are these cameras iterable via enumerateDevices()? If it is, and the user selects it via gUM, but also invokes it via this API, what happens?

My question about the GL texture was badly stated but boils down to what the internal format is. Is it "raw" (something new)? Is it ImageData? What makes it "raw"?

All of these questions suggest that we want to keep types, API surface, and other attributes in close alignment now and in the future, so a good design goal will be to at least outline how that would be done. So a fuller discussion of considered alternatives would be expected to speak to this, and I'm interested in seeing this doc grow in that way.

On moving IDL out of explainers, this comes back to long-standing guidance, both from my time as Chrome's Standards TL and the TAG's Explainer Explainer. The audience for an explainer is, primarily, a developer that's trying to think about how this design can work in their system, rather than how the design will be expressed in our system.

Crosslinks between documents help readers find the details that don't naturally belong in one document or the other. For example, IDL should not be in explainers, and non-normative discussion can often feel out of place in a spec.

alcooper91 commented 2 years ago

So I'm pretty comfortable with how XR rendering works as I helped review the original WebVR APIs while serving on the TAG. With that background, I am a bit surprised that the Explainer doesn't do a fuller job of explaining why we don't use getUserMedia() via extensions. E.g., getUserMedia() already has a system for camera toggling, and color space information from those cameras will also need to be sorted at some point.

I’m not sure I understand the question/ask here. The primary reason is that we want the camera frame to be directly associated and synchronized with the relevant WebXR data. Exposing this through getUserMedia would require massive changes (likely exposing XR data through the getUserMedia streams, and be a huge developer burden) and not guarantee this synchronization; which I feel the explainer does a pretty good job of describing, albeit not in as explicit detail, and I’m not sure I understand how the approaches you suggest help with that guarantee. We don’t really care about camera toggling, as we can’t change the camera during an AR session; and color space information is an open question for WebXR that we don’t want to expose yet. Adding a simple property to the existing WebXR data is more ergonomic for developers as well.

At an architectural level, those solutions should have a uniform API, if not an identical mechanism. E.g., do we need to design a new delegation mechanism for permission requests here? Are these cameras iterable via enumerateDevices()? If it is, and the user selects it via gUM, but also invokes it via this API, what happens?

I wouldn’t say we have a new delegation mechanism for permission requests. We check the same camera permission and prompt the same way as any other request to use the camera. The AR Camera is not iterable via enumerateDevices(); while I believe it may be technically possible to do so, the AR Session would have to be started in that mode, and will have decreased performance. Further, WebXR crops the camera image to match the screen aspect ratio, and we ensure that the delivered frame from this API is cropped the same way. This is not something we can do via gUM. It would be a privacy concern if the application receives a larger field of view from the camera than what the user sees on their screen.

My question about the GL texture was badly stated but boils down to what the internal format is. Is it "raw" (something new)? Is it ImageData? What makes it "raw"?

The "raw" naming isn't related to"raw images" as used in DSLR photography. In this context it refers to the distinction between "raw camera access" in the sense of "here's the camera image as a pixel grid" versus indirectly providing higher-level abstractions such as poses, planes, or hit test results which are the result of extensive processing done internally on the source camera data. There's nothing special about the image texture, it's equivalent to the kind of data you'd get by taking a snapshot of a video stream. Maybe "direct camera access" would have been a better name in hindsight to avoid confusion with other concepts such as getting unprocessed sensor data.

All of these questions suggest that we want to keep types, API surface, and other attributes in close alignment now and in the future, so a good design goal will be to at least outline how that would be done. So a fuller discussion of considered alternatives would be expected to speak to this, and I'm interested in seeing this doc grow in that way.

I’m not sure I agree. I think this API, and the one that would solve for/allow properly exposing Headset cameras or working for devices with multiple cameras solves two different problems and will likely be a vastly different API. Part of why we’re trying to ship this API separately now is to address the hand-held AR scenarios without needing to solve all of the open questions for the other space. I think a pointer to the relevant essentially open issue for the alternatives, is the best way to handle this since that discussion is ongoing. I think this goal is now clear in the explainer given the new non-goals section.

On moving IDL out of explainers, this comes back to long-standing guidance, both from my time as Chrome's Standards TL and the TAG's Explainer Explainer. The audience for an explainer is, primarily, a developer that's trying to think about how this design can work in their system, rather than how the design will be expressed in our system.

Ack, I think I may be confused with the early stage of an explainer that likely includes this proposed IDL before a formal spec is written.

Crosslinks between documents help readers find the details that don't naturally belong in one document or the other. For example, IDL should not be in explainers, and non-normative discussion can often feel out of place in a spec.

Sure, definitely understand that. I just hadn’t seen (and couldn’t find) any other examples of this; so let me know if you think I did the cross-linking in a reasonable way, or if there was established guidance on what to do here.

slightlyoff commented 2 years ago

Sorry for the slow follow-up here, @alcooper91.

Let me back up a bit. We should expect AR devices to be able to be all sorts of things - cameras, displays, sensor platforms for other types of input, etc. - which means that developers addressing the same hardware through different API surfaces is more than likely.

That argues, at a minimum, for APIs that are designed for long-term future alignment between the ways in which we configure, enumerate, and trigger capabilities about those sensors. I'm a little frustrated that the TAG didn't raise this sort of consistency in their review (perhaps @torgo can speak to the discussion there?), but it's the kind of thing we've hit in many other areas of platform development over the years and now understand needs to be handled affirmatively rather than being left as an exercise to further API evolution.

To that end, I need to see this explainer improve in a few dimensions:

This might be the ideal design, but from what I've read here, it's hard to make a case either way. Hopefully a fuller explainer will quash doubts.

alcooper91 commented 2 years ago

Thanks Alex.

While I agree that there may be many sensors that make up an AR session, I think that doesn’t preclude the ability of an AR Session to “lock” the state or capabilities of those sensors and/or require essentially exclusive access. I think the relevant specs for the APIs to access the various hardware separate from the WebXR Device API already (or should) have algorithms about what to do if a device cannot be found which satisfies the constraints. In the case of a device that is already in use by the WebXR Device API, I’d imagine that it would have its capabilities, resolution, etc. already locked (or the device marked as in-use/unavailable). However, as you mention this doesn’t just apply to the camera, but to other pieces of AR Hardware as well, and thus I don’t think this explainer/module is the appropriate place to dive into that discussion. As a result, after discussing this internally @klausw has filed https://github.com/immersive-web/webxr-ar-module/issues/87 on the core AR Module, which I think is the appropriate venue for that discussion. I can add some clarifying text somewhere that the returned camera image (from this API) will be configured based upon the User Agent’s requirements for compositing the AR session, if that would help add some clarity to this explainer?

  • considered alternatives should outline the tradeoffs in play WRT this design and getUserMedia(), with particular attention paid to the varied configuration options gUM takes and how we will ensure that developers are not confused in future when trying to access the same devices through both interfaces. That is, if this API configures the device in a lower-resolution, higher-frame-rate mode, but gUM is called with different configuration options, what happens?

Apologies if this is just an elaboration of the point I’ve already addressed, but the second line of the explainer states why getUserMedia() is insufficient for the use cases we’re describing:

Additionally, alternative ways for sites to obtain raw camera access (getUserMedia() web API) are not going to provide the application pose-synchronized camera images that could be integrated with WebXR Device API.

The purpose of this API is explicitly “I am using the WebXR Device API on a device that internally uses camera data to derive poses and environment features, and I want synchronized access to that camera image to integrate/use it with that pose data”. The target audience is developers who have already decided that they want to use the WebXR Device API for the benefits that it provides and explicitly do not want to use getUserMedia().

If you’re still unhappy with the alignment/discussion of the alignment between this API and getUserMedia after this explanation/proposed changes, I think it may be more beneficial to have a video chat, as I may be misunderstanding your concerns. Unfortunately the regularly scheduled immersive web meeting was yesterday and so the next one will be two weeks from now, so please reach out to me via the email I sent to blink-dev to schedule something.

  • discussion of concurrent and persistent access, permissions, and delegation unification given that this design creates a parallel surface area

What do you think about adding a line to this section of the spec like:

User Agents SHOULD also re-use permissions UI (or indeed permissions granted to the origin already) developed for the powerful feature named “[camera](https://www.w3.org/TR/mediacapture-streams/#dfn-camera)”.

Access is already scoped to the duration of the XR Session (because we’re leveraging many WebXR Device API concepts).

  • discussion of device enumeration vs. the existing APIs

See above discussion for why I think this is not the appropriate place to discuss that.

  • a clearer discussion of why this is not related to "RAW" camera data (and if other names are/were considered)

To my knowledge other names were not considered. I don’t know that any of us that have worked on this specification have a photographic background. Would a note like the following at the start of the explainer be sufficient?

Note: “Raw” in the context of this specification should be taken to mean “direct”, not “RAW” as in the camera format.

It’s worth noting that “Raw” doesn’t appear in any of the API’s themselves; so if you know the process to rename the Module/specification, we’d also be happy to pursue that and update references to “raw” within the explainer and specification.

  • moving IDL out to the spec and leading with example code for each of the problems the design asserts it's going to solve

I think this is addressed in my PR #15, I was waiting to merge it to see if there were further updates I needed to make as a result of our discussion. Let me know if there’s something else you’d like changed with that.