immersive-web / proposals

Initial proposals for future Immersive Web work (see README)
95 stars 11 forks source link

How to enable Custom Computer Vision for AR in WebXR #4

Closed blairmacintyre closed 5 years ago

blairmacintyre commented 6 years ago

As AR-capable WebXR implementations come online, it will hopefully be possible to do custom computer vision in the browser. Web assembly is plenty fast; the limitation is efficient access to the image/depth data + camera intrinsics + camera extrinsics from any cameras on the device, expressed in a form that is compatible with the WebXR device data. Easy synchronization with other possible sensors (e.g., accelerometers, etc).

It seems there are two ways to approach this:

As AR-capable devices (existing ones exposing APIs like Windows MR, ARKit, etc., or new devices with new APIs) become more pervasive, it should be reasonable to assume that an AR-capable device has the necessary data (camera and other sensor specs in relation to the display, an API that makes efficient access possible) to power such capabilities.

While it will not be necessary for all web-based AR applications to do custom CV, a wide variety of applications will need to. This is especially true if WebXR does not provide a standard set of capabilities that all platforms must implement.

blairmacintyre commented 6 years ago

I would like to "bump" this issue, and suggest we move it to a repo to work on. I have implemented a proof of concept in the Mozilla WebXR Viewer (plus our older webxr-polyfill that supports it), and it looks quite promising. The implementation is pretty clean and simple, but there are a number of ways this could evolve.

Essentially, my implementation does this (some of which should change, probably)

Open questions include

TrevorFSmith commented 6 years ago

@blairmacintyre I'd like to see more conversation and agreement in this Issue about specific approaches (e.g. your work in WebXR Viewer) before spinning out a repo. It might help to write up a summary of the API you used and then ask other members to weigh in.

blairmacintyre commented 6 years ago

ok

blairmacintyre commented 6 years ago

Can I also suggest people read the blog post I did on implementing a CV API in the WebXR Viewer? https://blog.mozvr.com/experimenting-with-computer-vision-in-webxr/

Over the course of building the samples and writing the post, my thinking evolved. I think we need to embrace some of the proposed WebRTC extensions that I linked to in that post as part of the solution. It would be great if we could get the WebRTC folks involved.

robman commented 6 years ago

/me waves hi 8)

We've done a lot of testing lately comparing the new gl.FRAMEBUFFER -> .readPixels() pipeline to the more traditional HTMLVideoElement -> canvas.ctx.drawImageData() -> canvas.ctx.getImageData() pipeline and were surprised to find it generally seems slower.

Plus we've done quite a bit of research into performance of computer vision processing in workers and found there's a range of surprising performance penalties there too.

However, we've been able to deliver full featured Natural Feature Tracking using WASM and the more traditional HTMLVideoElement -> canvas.ctx.drawImageData() -> canvas.ctx.getImageData() that runs at about 20-30fps on mobile browsers and about 50-60fps on desktop browsers.

We've also released support for #WebXR in our SaaS content creation platform (https://awe.media) and are very interested in supporting any work that extends the API more into the computer vision space. We're also interested in a more "Extensible Web Manifesto" approach that makes the underlying Features & Points available in as efficient and raw a format as possible to foster experimentation. Plus anything that helps improve the efficiency of accessing Pixels.

It would definitely be great to get @anssiko, @huningxin & @astojilj involved - they're my other co-authors on the Media Capture Depth Stream Extensions spec. Plus they all work with Moh (at Intel) who has been doing a lot of work on OpenCV.js, etc. https://pdfs.semanticscholar.org/094d/01d9eff739dce54c73bba06e097029e6f47a.pdf

cwilso commented 6 years ago

Incidentally, I'd make a strong, strong recommendation that use cases and scenarios be used to drive the explainers for new features/proposals, too. Before choosing an API shape, you need to narrow in on what problem you're trying to solve.

ddorwin commented 5 years ago

Work in this area should include specifying the Feature Policy controls associated with access to the camera data. The existing "camera" policy combined with other XR policies (https://github.com/immersive-web/webxr/issues/308) may be sufficient.

nbutko commented 5 years ago

From a vision perspective, the important things are low-level access to sensors and (on mobile phones) to drawing. The key is not to make these applications easy, but to make them possible.

Here are a few example use cases that cover several different requirements:

Here are some things that need to be possible to satisfy these applications:

blairmacintyre commented 5 years ago

Hi @nbutko I demonstrated much of what you are describing in the WebXR Viewer (see https://blog.mozvr.com/experimenting-with-computer-vision-in-webxr/ ).

There are a few more things I would add to your discussion:

nbutko commented 5 years ago

@blairmacintyre Do you have a link to the webrtc discussion you noted?

blairmacintyre commented 5 years ago

@nbutko alas, sorry, it was in email. I can fold you into a conversation there, if you like.

huningxin commented 5 years ago

@nbutko @blairmacintyre , FYI, the WebRTC Next Version Use Cases have computer vision based funny hats that requires new capabilities including raw media accessing, processed frame insertion and off main thread processing.

cwilso commented 5 years ago

Also, would like to note the nascent Shape Detection API here: https://wicg.github.io/shape-detection-api/.

blairmacintyre commented 5 years ago

Based on yesterday's CG call, I would propose that there are 4 potential use cases we should consider working on, intertwined but sufficiently separate we can talk about them separately. We may want to make these one new repo, with 4 explainers/sections.

  1. Extending WebRTC to support access of cameras on XR device sessions, and streaming/recording of video from XR devices. The scenario here is "worker wants to allow expert to see what they are seeing, and overlay augmentations in their view." This scenario applies to enterprise scenarios as well as consumer apps. For consumers, this is "home owner wants to show Home Depot consultant something, get guidance and order parts for repair". This work may need to be done over in the WebRTC WG, but we should track it here in a document with pointers, to avoid these scenarios coming up over and over. Create webrtc-remote-video.md
  2. Synchronous access to video frames in the GPU of frames from the camera on video-mixed-reality devices (e.g., phones running ARKit/ARCore). The scenario here is to do graphics effects by having access to the video in GPU memory, so the simple "overlay graphics on video" can be augmented with shadows, distortion and other effects. Potential to address privacy separately from (3) if we can arrange for it not to be possible to access video data in JS. Create video-effects.md
  3. Asynchronous access to video frames in the CPU. Asynch because most non-video-mixed devices do not run the camera and display at the same speed. The scenario here is to do real-time computer vision (e.g., SLAM like 6d.ai, 8thWall, etc are working on; CV tracking algorithms like Vuforia) in a platform independent way. Some of what will be done here might eventually make it into platforms (and already exists in some platforms), such as image detection. Others would include custom algorithms that need to work everywhere, for art, advertising, games, and so on. This is the scenario I talked about in the blog post mentioned above. Create cv-in-page.md
  4. Exposing some native, cross platform CV algorithms. The browser can expose entire algorithms running on the camera video, as suggested by the shape-detection api, which could start with very basic capabilities (like detecting barcodes in 3D, images, perhaps faces). Some of the specific algorithms could be optional, but it would be nice if there were very straightforward things (like barcodes, discussed in the shape-detection api) that could be implemented everywhere. Here we could talk about what it would be like to have some common capabilities, and how platform specific ones might be exposed (like ARKit/ARCore features, or perhaps for browsers like Argon4 that want to embed something like Vuforia). Create cv-in-browser.md
  5. There has been discussion of exposing some computer vision algorithm components, to allow native processing of video frames to be done before sending video frames into the app (either into the GPU or CPU, in 2 and 3 above), perhaps leveraging a library like Kronos' OpenVX. Essentially, we can start by thinking about this as WebVX. Being able to leverage optimized platform capabilities for doing well-known basic algorithms (image pyramids, simple feature extraction, image conversion, blur, etc) could speed up in-app CV, and also allow some of the effects that might be done in the synchronous GPU case to be done faster. Like the WebRTC discussion, if we wanted to pursue this, we would want to do that elsewhere, but the scenario has been brought up multiple times, so we should create webvx.md to summarize and record it, and point elsewhere if we pursue in.

Thats my proposal. (1) and (5) would not be worked on in depth, but would capture those use cases and point elsewhere. (2) and (3) are the most important, and share a common need to have the available cameras exposed and the developer request access to them. They can likely be done together (e.g., request cameras, direct data to CPU and/or GPU, guarantee that if the camera is synch with rAF that the data will be available before rAF and make this known if so), but are separable if we only want to tackle one first. (4) is orthogonal and can be added if we build (2) and/or (3).

NellWaliczek commented 5 years ago

This is fantastic, @blairmacintyre! I totally support the creation of this repo and the authoring of the documents you've outlined. And I'm very much looking forward to reading the proposals!

TrevorFSmith commented 5 years ago

I agree with @blairmacintyre's proposal and will now create a new feature repo with the goal of writing explainers for each topic in Blair's list and then to work mainly on topics 2 and 3,

Thanks to everyone who helped us reach clarity about what the topics and goals! It took a while but we'll make better progress with this in mind.

TrevorFSmith commented 5 years ago

The feature repo has been created: https://github.com/immersive-web/computer-vision

@blairmacintyre I've made you a repo admin with the assumption that you'll take the lead in putting together the initial structure and explainers as described above. If that's not a good assumption then let me know!

TrevorFSmith commented 5 years ago

NOTE: Future conversations on the topic of CV for XR should happen in the computer-vision Issues and when helpful should refer to this Issue, which will remain in the proposals repo.