w3c / mediacapture-depth

the Media Capture Depth Stream Extensions specification
https://w3c.github.io/mediacapture-depth/
Other
24 stars 20 forks source link

Proposal for FrameGrabber, FrameData, DepthMap #77

Closed anssiko closed 9 years ago

anssiko commented 9 years ago

This PR contains an early proposal with contributions from @huningxin @ds-hwang @robman and @anssiko. IDL is in place to allow people to review the API shape, but the prose around the new interfaces is still missing to allow us more easily iterate on the API design based on wider feedback.

HTML preview: https://rawgit.com/anssiko/mediacapture-depth/framegrabber/index.html

New interfaces added in this PR:

The following interfaces and definitions were obsoleted by the new ones, and were removed:

Examples:

Editorial:

anssiko commented 9 years ago

This fixes #76, #73, #72, and #66.

ds-hwang commented 9 years ago

looks nice! lgtm

huningxin commented 9 years ago

Thanks, @anssiko ! It looks nice. My comments provided.

anssiko commented 9 years ago

@huningxin @robman @ds-hwang Thanks for the review and comments. I've addressed them in an update to this PR.

Some questions on the new proposed DepthMap attributes:

robman commented 9 years ago

Hi @anssiko

For your second point, I don't believe we should drop near/far/format. Near and far variables required to reconstruct depth if you receive a normalised depth value. Basically format defines which algorithm to use.

For your last point, the measureType defines if the length is measured along the optical axis (depth down the X3 axis in the pinhole camera model) or as a ray (along the effective hypotenuse). Again this tells us what algorithm to use.

anssiko commented 9 years ago

@robman Thanks, what are the key use cases for which it would be preferred to use of the raw depth map data over the normalized data? And respectively, what are the key use case that require normalized data? I'm interested in figuring out how much of the details we can defer to the implementation, and how much we must expose to address the requirements. We should not expose metadata that has no supporting use case, and minimize the API surface instead, even if exposing metadata would be cheap.

Do the known implementations support both the measureTypes?

huningxin commented 9 years ago

Hi @anssiko

Thanks for the updating.

Why wouldn't we provide the data in mm units directly? With Uint16Array 0..65535 we could represent dexels in the range of 0..655.35 metres which is probably enough. No need for units attribute.

It sounds good to me. Web platform can stick to one "native" unit and data format. Implementations figure out how to fit into it. There are two flavors:

  1. Uint16 in mm (BTW, the the range would be 0..65.535 m, correct?)
  2. Float32 in m

What do you like?

This would allow us to drop format, near, and far since they are only needed by the implementation to implement the encoding pipeline.

To encoding the depth value, e.g. implement https://developers.google.com/depthmap-metadata/encoding on web, web developers need to know near and far.

What is the use case for measureType?

Agree with @robman , it tells web developer the depth measurement model, some algorithms depend on it.

near, far, measureType can be in MST's Settings.

anssiko commented 9 years ago

@ds-hwang pointed out that Uint16 maps better to GPU when uploading this data to a WebGL texture. I'll let @ds-hwang expand.

@huningxin Yes, using Uint16 with mm units would represent the range of 0..65.535 m as you say.

If we'd stick with a single format (say range linear) and units (say mm) what use cases would we miss? At least that'd make the API simpler for the web developer. IOW, what are the key use cases that require the web developer to be able to convert the normalised depth values back?

ds-hwang commented 9 years ago

Uint16 has enough accuracy. Let me explain.

First of all, we need to support RangeLinear mode and RangeInverse mode like https://developers.google.com/depthmap-metadata/encoding

User will calculate real distance using depth camera output. Let say depth value is 10000 on [0, 65535], near is 1m, far is 5m.

RangeLinear formula is RealDistance = d(far - near) + near so, 10000/65535 * (5m - 1m) + 1m is real distance. 65535 step is accurate enough

RangeInverse formula is RealDistance = (far x near)/(far - d*(far - near)) user can calculate real distance.

As you see, depth value is just scale. it don't has unit. near and far has unit.

RangeLinear is used for dance game or something and RangeInverse is used for face recognition.

On the other hands, chromium keeps the cam video in texture. In the same sense, chromium will keep depth value in texture. 32float texture is supported on only extremely modern gpu. IMO 32float texture is overkill.

huningxin commented 9 years ago

IMO, with Uint16 with mm units (or Float32 with m units), we won't need to support encoding format, say RangeLinear and RangeInverse. The Uint16 depth value represents the real distance, e.g. 1 means 1 mm and 65535 means 65535 mm. It is straightforward to web developers, just use the value without any calculations. The only concern is that whether the range and accuracy is too limited, say will 65.535 m be too small or mm units is too large for some use cases with some new depth cameras in the future?

I propose to keep near and far as we may need to support web app to encode the depth value into other formats. Just like https://developers.google.com/depthmap-metadata/encoding, if web developer wants to encode the depth metadata into XMP properties, they need the near and far values. They would write JavaScript code to implement either RangeLinear or RangeInverse encoding mechanism.

ds-hwang commented 9 years ago

I trust your judgement. However, I have to say there are drawback with value with unit.

Let's assume near is 0.1m and far is 5m. (0, 100) and (5000, 65535) are useless at that time. It's why tango encoding format uses scale value, instead of real value.

Some depth sensor in the future can have capability to detect >65m range. Some application want more precision <1mm.

If value has unit, it looks not flexible.

What is real output of RealSense and Kinect?

anssiko commented 9 years ago

I updated the PR. Please review and comment. The use case for measureType was unclear so I dropped it for now. Also unclear if all implementations are able to support it.

I'd like to get resolution on the issue whether we should just hardcode format and unit, or allow flexibility. We should investigate the capabilities of the existing implementations and hardware to make that call.

If we keep format, I'd guess it'd make sense to allow a web developer to configure the format, right? Hypothetical API:

partial dictionary MediaStreamConstraints {
    // ...
    (DOMString or MediaTrackConstraints) depthFormat = "linear";
};
huningxin commented 9 years ago

What is real output of RealSense and Kinect?

RealSense SDK:

Format Description
PIXEL_FORMAT_DEPTH The depth map data in 16-bit unsigned integer. The values indicate the distance from an object to the camera's XY plane or the Cartesian depth.The value precision is in millimeters.
PIXEL_FORMAT_DEPTH_RAW The depth map data in 16-bit unsigned integer. The value precision is device specific. The application can get the device precision via the QueryDepthUnit function;
PIXEL_FORMAT_DEPTH_F32 The depth map data in 32-bit floating point. The value precision is in millimeters.

Kinect SDK: https://msdn.microsoft.com/en-us/library/microsoft.kinect.kinect.idepthframe.aspx The data for this frame is stored as 16-bit unsigned integers, where each value represents the distance in millimeters. The maximum depth distance is 8 meters, although reliability starts to degrade at around 4.5 meters. Developers can use the depth frame to build custom tracking algorithms in cases where the IBodyFrame isn’t enough.

Project Tango: https://developers.google.com/project-tango/overview/depth-perception#point_clouds The Project Tango APIs provide a function to get depth data in the form of a point cloud. This format gives (x, y, z) coordinates for as many points in the scene as are possible to calculate. Each dimension is a floating point value recording the position of each point in meters in the coordinate frame of the depth-sensing camera.

Some depth sensor in the future can have capability to detect >65m range. Some application want more precision <1mm

This is why Float32 with meter units seems promising.

huningxin commented 9 years ago

Point Cloud Library (PCL) is using float in RangeImage (depth map): http://docs.pointclouds.org/trunk/classpcl_1_1_range_image.html

with millimeters units: http://docs.pointclouds.org/trunk/classpcl_1_1_image_grabber_base.html#a32ae91b66b415213ec3b7c29d7e61e49

anssiko commented 9 years ago

@huningxin Thanks for sharing information on implementations. What is your guesstimate re the performance implications (also memory) of using Float32Array over Uint16Array? I think we should expect frame rates of >=30 Hz. I guess some benchmark data would help to make an informed decision. @ds-hwang had some concerns, but we don't have benchmark data at hand now.

If there are performance concerns, one approach worth considering might be to go with the lowest common denominator (e.g. Uint16Array, mm units) first, while allow future extensions. For example, a new MediaTrackConstrains could be used to indicate higher precision is preferred, and the type of data could be updated to (Uint16Array or Float32Array) while keeping the API backwards compatible. Not optimal considering interoperability, but I think that might be a reasonable tradeoff to make.

Re units, mm sounds like the best choice regardless of the type.

huningxin commented 9 years ago

@anssiko , thanks for the comments. I agree with you that the Uint16Array seems to be closer to hardware.

Because: In RealSense SDK, the depth map data is in 16-bit unsigned integer of PIXEL_FORMAT_DEPTH_RAW format. The units is device defined.

Kinect SDK uses 16-bit unsigned integers with mm

According to Tango Depth Camera implementation in Chromium (https://code.google.com/p/chromium/codesearch#chromium/src/media/base/android/java/src/org/chromium/media/VideoCaptureTango.java&q=tango&sq=package:chromium&l=141),

    // Depth is composed of 16b samples in which only 12b are
    // used.

It is also uint16.

So Uint16Array with mm units looks good to me.

For the units, I suggest we add long units into MediaTrackConstrains.

anssiko commented 9 years ago

@huningxin Thanks again for your suggestions. I updated the spec to address your comments. I also did further refactoring. I'd like to review this after your final review.

huningxin commented 9 years ago

hi @anssiko , thanks for your efforts and explanation. It makes the spec pretty good.

LGTM with one open.

anssiko commented 9 years ago

@huningxin @robman @ds-hwang I'll merge this PR now and craft a mail to group to get wider feedback. Thanks for your contributions and review!