Closed blairmacintyre closed 5 years ago
I would like to "bump" this issue, and suggest we move it to a repo to work on. I have implemented a proof of concept in the Mozilla WebXR Viewer (plus our older webxr-polyfill that supports it), and it looks quite promising. The implementation is pretty clean and simple, but there are a number of ways this could evolve.
Essentially, my implementation does this (some of which should change, probably)
requestVideoFrame
analogous to requestAnimationFrame
to control frame rate. I'm not sending the function in each time, since that wouldn't make sense with the Worker. If there is no advantage to providing the Worker directly, we could just use the same structure as rAF and let the page pass the method off to the worker.getVideoFramePose(videoFrame, poseOut)
that transforms it to the current coordinate system for rendering. The idea is that the video frames may be old and thus no longer in the same coordinate frame so we need to provide a method to make them valid (since platforms like ARKit, ARCore and Hololens say you should ask, during the rendering frame, for the pose of the camera and all anchors, and these may not be valid from frame to frame). (Internally, I create platform Anchors over time and express the camera relative to one of them, and this method just adjusts based on the current Anchor pose.)Open questions include
@blairmacintyre I'd like to see more conversation and agreement in this Issue about specific approaches (e.g. your work in WebXR Viewer) before spinning out a repo. It might help to write up a summary of the API you used and then ask other members to weigh in.
ok
Can I also suggest people read the blog post I did on implementing a CV API in the WebXR Viewer? https://blog.mozvr.com/experimenting-with-computer-vision-in-webxr/
Over the course of building the samples and writing the post, my thinking evolved. I think we need to embrace some of the proposed WebRTC extensions that I linked to in that post as part of the solution. It would be great if we could get the WebRTC folks involved.
/me waves hi 8)
We've done a lot of testing lately comparing the new gl.FRAMEBUFFER -> .readPixels()
pipeline to the more traditional HTMLVideoElement -> canvas.ctx.drawImageData() -> canvas.ctx.getImageData()
pipeline and were surprised to find it generally seems slower.
Plus we've done quite a bit of research into performance of computer vision processing in workers and found there's a range of surprising performance penalties there too.
However, we've been able to deliver full featured Natural Feature Tracking using WASM and the more traditional HTMLVideoElement -> canvas.ctx.drawImageData() -> canvas.ctx.getImageData()
that runs at about 20-30fps on mobile browsers and about 50-60fps on desktop browsers.
We've also released support for #WebXR in our SaaS content creation platform (https://awe.media) and are very interested in supporting any work that extends the API more into the computer vision space. We're also interested in a more "Extensible Web Manifesto" approach that makes the underlying Features & Points available in as efficient and raw a format as possible to foster experimentation. Plus anything that helps improve the efficiency of accessing Pixels.
It would definitely be great to get @anssiko, @huningxin & @astojilj involved - they're my other co-authors on the Media Capture Depth Stream Extensions spec. Plus they all work with Moh (at Intel) who has been doing a lot of work on OpenCV.js, etc. https://pdfs.semanticscholar.org/094d/01d9eff739dce54c73bba06e097029e6f47a.pdf
Incidentally, I'd make a strong, strong recommendation that use cases and scenarios be used to drive the explainers for new features/proposals, too. Before choosing an API shape, you need to narrow in on what problem you're trying to solve.
Work in this area should include specifying the Feature Policy controls associated with access to the camera data. The existing "camera"
policy combined with other XR policies (https://github.com/immersive-web/webxr/issues/308) may be sufficient.
From a vision perspective, the important things are low-level access to sensors and (on mobile phones) to drawing. The key is not to make these applications easy, but to make them possible.
Here are a few example use cases that cover several different requirements:
Here are some things that need to be possible to satisfy these applications:
Hi @nbutko I demonstrated much of what you are describing in the WebXR Viewer (see https://blog.mozvr.com/experimenting-with-computer-vision-in-webxr/ ).
There are a few more things I would add to your discussion:
@blairmacintyre Do you have a link to the webrtc discussion you noted?
@nbutko alas, sorry, it was in email. I can fold you into a conversation there, if you like.
@nbutko @blairmacintyre , FYI, the WebRTC Next Version Use Cases have computer vision based funny hats that requires new capabilities including raw media accessing, processed frame insertion and off main thread processing.
Also, would like to note the nascent Shape Detection API here: https://wicg.github.io/shape-detection-api/.
Based on yesterday's CG call, I would propose that there are 4 potential use cases we should consider working on, intertwined but sufficiently separate we can talk about them separately. We may want to make these one new repo, with 4 explainers/sections.
Thats my proposal. (1) and (5) would not be worked on in depth, but would capture those use cases and point elsewhere. (2) and (3) are the most important, and share a common need to have the available cameras exposed and the developer request access to them. They can likely be done together (e.g., request cameras, direct data to CPU and/or GPU, guarantee that if the camera is synch with rAF that the data will be available before rAF and make this known if so), but are separable if we only want to tackle one first. (4) is orthogonal and can be added if we build (2) and/or (3).
This is fantastic, @blairmacintyre! I totally support the creation of this repo and the authoring of the documents you've outlined. And I'm very much looking forward to reading the proposals!
I agree with @blairmacintyre's proposal and will now create a new feature repo with the goal of writing explainers for each topic in Blair's list and then to work mainly on topics 2 and 3,
Thanks to everyone who helped us reach clarity about what the topics and goals! It took a while but we'll make better progress with this in mind.
The feature repo has been created: https://github.com/immersive-web/computer-vision
@blairmacintyre I've made you a repo admin with the assumption that you'll take the lead in putting together the initial structure and explainers as described above. If that's not a good assumption then let me know!
NOTE: Future conversations on the topic of CV for XR should happen in the computer-vision Issues and when helpful should refer to this Issue, which will remain in the proposals repo.
As AR-capable WebXR implementations come online, it will hopefully be possible to do custom computer vision in the browser. Web assembly is plenty fast; the limitation is efficient access to the image/depth data + camera intrinsics + camera extrinsics from any cameras on the device, expressed in a form that is compatible with the WebXR device data. Easy synchronization with other possible sensors (e.g., accelerometers, etc).
It seems there are two ways to approach this:
As AR-capable devices (existing ones exposing APIs like Windows MR, ARKit, etc., or new devices with new APIs) become more pervasive, it should be reasonable to assume that an AR-capable device has the necessary data (camera and other sensor specs in relation to the display, an API that makes efficient access possible) to power such capabilities.
While it will not be necessary for all web-based AR applications to do custom CV, a wide variety of applications will need to. This is especially true if WebXR does not provide a standard set of capabilities that all platforms must implement.