Open TrevorFSmith opened 7 years ago
Chatting with NingXin Hu, he suggested mediacapture-worker (https://w3c.github.io/mediacapture-worker/) might be a good starting point for how to structure a vision worker.
That is essentially what I have imagined: a worker-like setup where WebXR can execute custom CV code (perhaps in WebAsssembly, Javascript or even on the GPU) for each video frame. We have the opportunity to provide whatever necessary data we want (e.g., camera intrinsics, pose of camera relative to some frame of reference, other time-synchronized sensor data such as accelerometers/gyros, etc) as well.
Some of this (sensor data) might be best provided via a separate sensor API (assuming we can leverage shared memory to share it between workers). I think (when we look at modern camera APIs) we might want to consider assuming we have things like intrinsics for each camera: at least, make this an optional field.
Having the camera video not just be assumed to be “the video we are overlaying AR onto” is essential, I think: we want to support see-through devices with cameras (like Hololens), multi-camera devices, and devices (like Vive) that have cameras that don’t align/cover the camera view.
We should assume that we can provide the camera pose relative to some “display” frame of reference. CV for AR (in general) has been greatly hampered by not knowing the calibrated structure of the display and sensor package, but when you look at real devices (e.g., Hololens, Vive, etc) that have cameras, the relationships between the device coordinate system and camera, along with the camera intrinsics, is pre-calibrated. ARKit and ARCore will also provide this information on mobile, and I assume any custom HMD will be able to provide it for any attached devices.
The MediaCapture Worker doc is marked as inactive, so probably not going to help on the implementation side, but I agree that the pattern is one that could work for this.
Yes, we need to handle camera data of varying types and FOV coverage with intrinsics to inform the CV algorithms.
Yes, I don't mean "use it": we don't want to use WebRTC at all, directly. What I envision, eventually, might be a way to "add in" WebRTC sources to the worker structure, but for now, I think the video sources would be accessed and configured via WebXR, because we only want ones that really have the information we need, and that we can access efficiently.
I was thinking of the patterns, yes.
Agree. I don't suggest to take MediaCapture worker spec as is.
We (with Mozilla folks) used to try bringing CV to web. We made some progresses on MediaCapture worker for off-main-thread processing, ImageBitmap extension for efficient captured image data access, MediaCapture depth extension for depth camera access and OpenCV.js for CV algorithms on web (asm.js at that time, now support wasm).
I think we can leverage some experiences obtained from previous work and benefit the CV use cases in WebXR.
I am thinking of two use scenarios of camera data:
The first case can be handled by the main thread. The second case needs to be handled by a worker thread.
It requires to represent the camera data by a opaque handle. The handle supports uploading image data to GPU if the data is in CPU memory or skip that if data is already in GPU memory. The handle also supports copying the camera data to WebAssembly heap for CPU processing case. It should avoid the unnecessary color-conversion and memory copies of current mediastream -> video -> canvas pipeline. ImageBitmap extension is a good fit here.
Initiated a API sketch https://github.com/mozilla/webxr-api/pull/18 for discussion.
Right now, there is no way to request the Reality camera data in order to do computer vision tasks like marker detection.
Stub out an API on Realities to request access to the camera data. Stub out an API on Realities so that CV libs that detect markers can integrate them as XRAnchors.
Eventually the UA should display the camera data without giving the JS app access and only when the app requests direct access to the camera data will it trigger the security prompt and check. For now, UAs that are giving the camera data via the WebRTC media streams use that security prompt and check.