Closed anssiko closed 9 years ago
This fixes #76, #73, #72, and #66.
looks nice! lgtm
Thanks, @anssiko ! It looks nice. My comments provided.
@huningxin @robman @ds-hwang Thanks for the review and comments. I've addressed them in an update to this PR.
Some questions on the new proposed DepthMap
attributes:
data
in mm units directly? With Uint16Array 0..65535 we could represent dexels in the range of 0..65.535 metres which is probably enough. No need for units
attribute.data
in mm units. This would allow us to drop format
, near
, and far
since they are only needed by the implementation to implement the encoding pipeline.measureType
? The distance measured along the optical axis is longer, so I assume with that information a web developer can get more precise depth measurement. What other data the web developer would need in order to make use of measureType
for real (is what we have defined in Settings
enough)?Hi @anssiko
For your second point, I don't believe we should drop near/far/format. Near and far variables required to reconstruct depth if you receive a normalised depth value. Basically format defines which algorithm to use.
For your last point, the measureType defines if the length is measured along the optical axis (depth down the X3 axis in the pinhole camera model) or as a ray (along the effective hypotenuse). Again this tells us what algorithm to use.
@robman Thanks, what are the key use cases for which it would be preferred to use of the raw depth map data over the normalized data? And respectively, what are the key use case that require normalized data? I'm interested in figuring out how much of the details we can defer to the implementation, and how much we must expose to address the requirements. We should not expose metadata that has no supporting use case, and minimize the API surface instead, even if exposing metadata would be cheap.
Do the known implementations support both the measureTypes?
Hi @anssiko
Thanks for the updating.
Why wouldn't we provide the data in mm units directly? With Uint16Array 0..65535 we could represent dexels in the range of 0..655.35 metres which is probably enough. No need for units attribute.
It sounds good to me. Web platform can stick to one "native" unit and data format. Implementations figure out how to fit into it. There are two flavors:
What do you like?
This would allow us to drop format, near, and far since they are only needed by the implementation to implement the encoding pipeline.
To encoding the depth value, e.g. implement https://developers.google.com/depthmap-metadata/encoding on web, web developers need to know near and far.
What is the use case for measureType?
Agree with @robman , it tells web developer the depth measurement model, some algorithms depend on it.
near
, far
, measureType
can be in MST's Settings
.
@ds-hwang pointed out that Uint16 maps better to GPU when uploading this data to a WebGL texture. I'll let @ds-hwang expand.
@huningxin Yes, using Uint16 with mm units would represent the range of 0..65.535 m as you say.
If we'd stick with a single format (say range linear) and units (say mm) what use cases would we miss? At least that'd make the API simpler for the web developer. IOW, what are the key use cases that require the web developer to be able to convert the normalised depth values back?
Uint16 has enough accuracy. Let me explain.
First of all, we need to support RangeLinear mode and RangeInverse mode like https://developers.google.com/depthmap-metadata/encoding
User will calculate real distance using depth camera output. Let say depth value is 10000 on [0, 65535], near is 1m, far is 5m.
RangeLinear formula is RealDistance = d(far - near) + near so, 10000/65535 * (5m - 1m) + 1m is real distance. 65535 step is accurate enough
RangeInverse formula is RealDistance = (far x near)/(far - d*(far - near)) user can calculate real distance.
As you see, depth value is just scale. it don't has unit. near and far has unit.
RangeLinear is used for dance game or something and RangeInverse is used for face recognition.
On the other hands, chromium keeps the cam video in texture. In the same sense, chromium will keep depth value in texture. 32float texture is supported on only extremely modern gpu. IMO 32float texture is overkill.
IMO, with Uint16 with mm units (or Float32 with m units), we won't need to support encoding format, say RangeLinear and RangeInverse. The Uint16 depth value represents the real distance, e.g. 1 means 1 mm and 65535 means 65535 mm. It is straightforward to web developers, just use the value without any calculations. The only concern is that whether the range and accuracy is too limited, say will 65.535 m be too small or mm units is too large for some use cases with some new depth cameras in the future?
I propose to keep near and far as we may need to support web app to encode the depth value into other formats. Just like https://developers.google.com/depthmap-metadata/encoding, if web developer wants to encode the depth metadata into XMP properties, they need the near and far values. They would write JavaScript code to implement either RangeLinear or RangeInverse encoding mechanism.
I trust your judgement. However, I have to say there are drawback with value with unit.
Let's assume near is 0.1m and far is 5m. (0, 100) and (5000, 65535) are useless at that time. It's why tango encoding format uses scale value, instead of real value.
Some depth sensor in the future can have capability to detect >65m range. Some application want more precision <1mm.
If value has unit, it looks not flexible.
What is real output of RealSense and Kinect?
I updated the PR. Please review and comment. The use case for measureType
was unclear so I dropped it for now. Also unclear if all implementations are able to support it.
I'd like to get resolution on the issue whether we should just hardcode format
and unit
, or allow flexibility. We should investigate the capabilities of the existing implementations and hardware to make that call.
If we keep format
, I'd guess it'd make sense to allow a web developer to configure the format
, right? Hypothetical API:
partial dictionary MediaStreamConstraints {
// ...
(DOMString or MediaTrackConstraints) depthFormat = "linear";
};
What is real output of RealSense and Kinect?
RealSense SDK:
Format | Description |
---|---|
PIXEL_FORMAT_DEPTH | The depth map data in 16-bit unsigned integer. The values indicate the distance from an object to the camera's XY plane or the Cartesian depth.The value precision is in millimeters. |
PIXEL_FORMAT_DEPTH_RAW | The depth map data in 16-bit unsigned integer. The value precision is device specific. The application can get the device precision via the QueryDepthUnit function; |
PIXEL_FORMAT_DEPTH_F32 | The depth map data in 32-bit floating point. The value precision is in millimeters. |
Kinect SDK: https://msdn.microsoft.com/en-us/library/microsoft.kinect.kinect.idepthframe.aspx The data for this frame is stored as 16-bit unsigned integers, where each value represents the distance in millimeters. The maximum depth distance is 8 meters, although reliability starts to degrade at around 4.5 meters. Developers can use the depth frame to build custom tracking algorithms in cases where the IBodyFrame isn’t enough.
Project Tango: https://developers.google.com/project-tango/overview/depth-perception#point_clouds The Project Tango APIs provide a function to get depth data in the form of a point cloud. This format gives (x, y, z) coordinates for as many points in the scene as are possible to calculate. Each dimension is a floating point value recording the position of each point in meters in the coordinate frame of the depth-sensing camera.
Some depth sensor in the future can have capability to detect >65m range. Some application want more precision <1mm
This is why Float32 with meter units seems promising.
Point Cloud Library (PCL) is using float in RangeImage (depth map): http://docs.pointclouds.org/trunk/classpcl_1_1_range_image.html
with millimeters units: http://docs.pointclouds.org/trunk/classpcl_1_1_image_grabber_base.html#a32ae91b66b415213ec3b7c29d7e61e49
@huningxin Thanks for sharing information on implementations. What is your guesstimate re the performance implications (also memory) of using Float32Array
over Uint16Array
? I think we should expect frame rates of >=30 Hz. I guess some benchmark data would help to make an informed decision. @ds-hwang had some concerns, but we don't have benchmark data at hand now.
If there are performance concerns, one approach worth considering might be to go with the lowest common denominator (e.g. Uint16Array
, mm units) first, while allow future extensions. For example, a new MediaTrackConstrains
could be used to indicate higher precision is preferred, and the type of data
could be updated to (Uint16Array or Float32Array)
while keeping the API backwards compatible. Not optimal considering interoperability, but I think that might be a reasonable tradeoff to make.
Re units, mm sounds like the best choice regardless of the type.
@anssiko , thanks for the comments. I agree with you that the Uint16Array
seems to be closer to hardware.
Because:
In RealSense SDK, the depth map data is in 16-bit unsigned integer of PIXEL_FORMAT_DEPTH_RAW
format. The units is device defined.
Kinect SDK uses 16-bit unsigned integers with mm
According to Tango Depth Camera implementation in Chromium (https://code.google.com/p/chromium/codesearch#chromium/src/media/base/android/java/src/org/chromium/media/VideoCaptureTango.java&q=tango&sq=package:chromium&l=141),
// Depth is composed of 16b samples in which only 12b are
// used.
It is also uint16
.
So Uint16Array
with mm
units looks good to me.
For the units, I suggest we add long units
into MediaTrackConstrains
.
@huningxin Thanks again for your suggestions. I updated the spec to address your comments. I also did further refactoring. I'd like to review this after your final review.
hi @anssiko , thanks for your efforts and explanation. It makes the spec pretty good.
LGTM with one open.
@huningxin @robman @ds-hwang I'll merge this PR now and craft a mail to group to get wider feedback. Thanks for your contributions and review!
This PR contains an early proposal with contributions from @huningxin @ds-hwang @robman and @anssiko. IDL is in place to allow people to review the API shape, but the prose around the new interfaces is still missing to allow us more easily iterate on the API design based on wider feedback.
HTML preview: https://rawgit.com/anssiko/mediacapture-depth/framegrabber/index.html
New interfaces added in this PR:
FrameGrabber
FrameData
DepthData
The following interfaces and definitions were obsoleted by the new ones, and were removed:
CanvasImageSource typedef
ImageData interface
(or to be precise, extensions to it)Examples:
Editorial: