Define standard for depth-camera and kinect-like devices

barbalberto commented 8 years ago

Working with depth camera and kinect like devices we found out the lack of a clear and shared definition of data type for those sensors.

An internal meeting will be of help on monday 23/11/15 afternoon to define one and a tentative schedule for code refactoring will be set.

This issue will follows the updates on topic and host feedbacks.

barbalberto commented 8 years ago

During the meeting we agreed on the following points:

Wrapper and client devices have to be independent from any external library (NiTE, openNI ...).
Different interfaces will be created, abstracting specific functionalities.
IFrameGrabber interface will be used for extracting RGB images.
- If possible Framegrabber device will be instantiated by depthCamera wrapper in order to reuse an existing device and have access to RGB camera image customization like saturation, brightness and so on (if available from device).
New IDepth interfaces will be created to get depth images.
- IDepth interface must have a way to get RGB and depth camera synchronized; how to achieve this could be either by doing some magic on client side or by asking the server to open a special port sending both together.
- float values will be used to express depth in [meters].
- max and min depth will be set to indicate preferred range of use of device and to convert a float data to int16 or int8 data type. This can be useful to compare HW depth sensors with result of disparity map algorithm or different HW sensors together.
- This device will take care of one RGB + depth camera pair, if multiple RGB cameras are available on the robot, no synchronization between them will be provided by this device.
RGB port has to be compliant with yarpview.
skeleton information will be available through another interface (e.g. something like IHumanTracking); this interface may or may not be implemented by device driver since it is concerned with software computation and usually not implemented directly by the HW.
Eventually a IPointCloud interface can be created.

As first step, the new interface for depth camera will be created and a new pair of Wrapper/Client will be created with the idea that, in the long run, this will be the official YARP depth camera wrapper and all others will be discontinued.

Other interfaces like IPointCloud will be added later on, or a new wrapper/client will be created integrating them.

SKELETON TRACKING

Ideally, the IHumanTracking interface should not be implemented by a device driver if the skeleton identification and tracking are not provided by the HW. Similarly, it should not be done by the wrapper, because it is not the purpose of the wrapper doing any elaboration on the data but just make them available through the network to a remote client.

As long as the identification and tracking is done by software, a separated software module should implement the algorithm based on the data read from the wrapper. This is for code re-usability, to allows many identification libraries to be used on same data and to avoid adding unnecessary dependencies on the wrapper code.

barbalberto commented 8 years ago

About skeleton, one issue is still pending: We should find a YARP definition of what a skeleton is, if possible, to allow code written by different user to share data between them and work together. From code I saw, it looks like the basic definition of a skeleton includes a mix of the following information:

A vector / list of:
- 3d position (x,y,z) referred to "real world". It is not clear which reference frame those coordinate are referred to.
- 2d position (w,h). Pixel coordinate in the image
- orientation
- optional: point name
confidence

Questions to anyone working with human identification and skeleton: what could be a shared deifinition of a skeleton?

@tanismar @Tobias-Fischer @kt10aan

Tobias-Fischer commented 8 years ago

Hi @barbalberto, First of all thanks for taking the initiative for defining a standard for depth cameras in yarp. I think it's long overdue. In our project, we use https://github.com/robotology/kinect-wrapper There, a skeleton is defined by exactly the above mentioned attributes, except of that the confidence is defined per joint rather than a general confidence value. I personally would purely work with the real world coordinates, and provide an interface to retrieve the image coordinates from world coordinates if needed. Furthermore, the standard should probably provide a method to get a list of all possible skeleton nodes, and edges between the nodes. Then they can be properly visualized, and things like angles between the joints can be calculated.

Best, Tobias PS: The point cloud interface would be a huge enhancement to yarp, and is probably the biggest chunk missing compared to ROS when it comes to robot perception. And an alternative to rviz ..

lornat75 commented 8 years ago

IDepth is a bit ambiguous, why not IRGBD or IRgbd?

traversaro commented 8 years ago

A skeleton interface could also be used to interface with mocap systems, perhaps @claudia-lat could be interested in reading about this discussion.

One comment on the discussion: it could be a good idea to avoid saying things are expressed in the "world" or "real world" frame. If you have multiple sensors (IMU, Kinect, Mocap) having multiple "real worlds" could be confusing. : )

Sensor frame could be an alternative, but even that is prone to confusion (for an IMU, its "world" frame is totally different from its "sensor" frame).

drdanz commented 8 years ago

IDepth is a bit ambiguous, why not IRGBD or IRgbd?

What about IDepthMap? I believe that IRGBD is not a good name here, since if I understand correctly, @barbalberto is talking about a depth only interface, not rgbd.

RGB port has to be compliant with yarpview.

IMO also the depth map port should be compatible

Ideally, the IHumanTracking interface should not be implemented by a device driver if the skeleton identification and tracking are not provided by the HW.

What about drivers that implement skeleton for the hand? IHumanTracking does not sound good here.

lornat75 commented 8 years ago

From what i have understood the proposal is for an interface that groups methods to get RGB and Depth (in an atomic call) therefore it is an RGBD sensor. Depth images can come from many sensors that do not have RGB.

barbalberto commented 8 years ago

All names of interfaces are just tentative, so open to discussion. Actually my proposal was not clear enough, hence the misunderstanding, sorry... I'll try to explain better.

My idea at first was to have interface just for depth, so that it could be used also for other kind of sensors like lasers (now hokuyo uses genericSensor). Then a rgbd device will simply implement both rgb (current framegrabber) + depth.

Problem: In this case there will be no way for the client to get the pair rgb+depth images synchronized because they are acquired through 2 different interfaces.

Since synchronization seems to be a good plus, then we thought that the depth interface would be a fitting place for adding a method to get both of them; so the depth interface will have both getDepth() and getRGBD() (like @lornat75 says), which is not so clean and not so reusable anymore for other sensors like laser.

The alternative would be to have yet another interface like iDepth and iRGBD, so that iDepth will have olny detDepth(), while iRGBD will inherit from rgb and iDepth and add this getRGBD() method.

This will create more granularity allowing each devide to reflect exactly what it does, on the other hand it'll create a pollution of interfaces with just one or two methods each, which I don't like too much (but I admitt it is more clear and probably avoid confusion).

In any case, since rgb image and depth image will be broadcasted though 2 different ports, synch issues are a problem of the client which have to check somehow if images are in synch or not.

IMO also the depth map port should be compatible

yes I agree, the point here is that depth images will be in float, is the yarpview able to plot a float image? The idea of adding a max/min range was to rescale data into a integer number also for this purpose. If this is the case, who shall carry out this convertion, the wrapper, the yarpview or a carrier?

What about drivers that implement skeleton for the hand? IHumanTracking does not sound good here.

Why not? the hand is still a piece of human body :-P The idea was to differentiate between tracking of body and objects. If the definition of skeleton is good enough there will be no ambiguity while receiving data.

drdanz commented 8 years ago

I think we should split data representation from the device, and add the required classes to yarp::sig. Depending on what we get, we should think about extending the yarp::dev::FrameGrabber, or creating a new interface (yarp::dev::DepthGrabber? yarp::dev::PointCloudGrabber? yarp::dev::RGBDGrabber? We can think about this later). What are the ways to represent 3D data? Maybe I'm totally wrong here, since I've never worked with depth cameras, but I suppose something like this is required (together with a set of functions to switch from one format to another)

color only -> grid (x, y) -> rgb : Image
depth sensor only -> grid (x, y) -> d : DepthMap
depth sensor + color -> grid (x, y) -> rgbd : RGBD
unstructured list of points: PointCloud (xyz, xyzrgb).

An important thing (and that I believe it is currently missing in yarp::sig::Image) is that each of them should have a viewpoint associated (translation + rotation). As an alternative, we could use the envelope to transmit this information, like we do for the oculus, but we should spend some more time to define a class that contains both the timestamp and the viewpoint and that can be read as a timestamp in order not to break compatibility.

yarp::sig::ImageOf should be easy to extend to support both DepthMap and RGBD. I don't see a good reason here for forcing float data at data representation level. Also RGB automatically forces the format for the color information, maybe ColorDepthMap is a more appropriate name.

The yarp::sig::PointCloud class should map to pcd files and pcl::PointCloud classes, see http://www.pointclouds.org/documentation/tutorials/pcd_file_format.php and http://docs.pointclouds.org/trunk/classpcl_1_1_point_cloud.html . Probably ImageOf can still be used here with a meaning similar to PCL, i.e. height = 1 => unorganized point cloud

tanismar commented 8 years ago

Hi! I'm not so knowledgeable at the low level on how to define the interfaces to properly and compatibly read depth data from different devices, but I've done some work to ease working with pointclouds on YARP: On the one hand, there is a module initially developed by Ugo and further enhanced by myself which reads data from color + depth images (rgb+d), and given a desired crop, sends a list of xyzrgb points, and optionally saves it also as ply (for pcl) or .off (for meshlab). This module does NOT depend on pcl library, and so the output is a bottle of bottles, where each sub-bottle is made up of the 6 values (xyzrgb) as doubles. The module might require some extra work, but the working can be found here: https://github.com/tanismar/obj3Drec

On the other hand, I did a small library to transform between those bottles and the pcl format, and which functions might be useful for the PointCloudGrabber (or whatever the final name is. The library is here: https://github.com/tanismar/objects3DModeler/tree/master/libYarpCloud

jgvictores commented 8 years ago

1) There are new actors in the field we shouldn't forget about: Intel Realsense F200/SR300/R200, with recent more official linux/osx support (apart from previously existing Windows SDK): https://software.intel.com/en-us/blogs/2016/01/26/realsense-linux-osx-drivers. I still do not physically have this type of device, but expect to have one soon. I'd issue the driver as a YARP enhancement once the OpenNI2 or similar drivers are more or less stable.

2) There is extra information that must be transmitted:

Regarding if the image is mirrored (see #683 and #690)
Regarding registration (whether if rgb/depth sensor misalignment has been dealt with).
Regarding viewpoint (any kind of 3d pose representation).

The envelope solution proposed by @drdanz seems like a good idea; it's a matter of taking the right design decisions to not induce too much overhead. Am I missing any other information that may prove valuable??

3) Regarding overhead/efficiency, I'd vote for vector types (as opposed to bottles as @tanismar, which were a good first approximation). cc @drdanz, @lornat75, @pattacini could we have your opinion??

lornat75 commented 8 years ago

I agree that vectors are better than bottles. Assuming vectors an store all the required information. We can also come up with a specific type and write it as an (efficient) portable.

barbalberto commented 8 years ago

@jgvictores

1) Good to know: better to complete the migration before buying one of that kind, so if some work needs to be done for a 'custom' driver it'll start already with the right foot :smile: 2) How to insert addictional information in the depth/rgb image is open to discussion. We could either send a data type pairOf<image, something> or use a different port for extra data. If we start using server/client the addictional port will be transparent, but synch issue will increase. 3) When you say vector type instead of bottles are you talking about depth/rgb images or point clouds?

barbalberto commented 7 years ago

New version of interface is defined in the IRGBD_interface_v2 branch

drdanz commented 7 years ago

@barbalberto #974 was merged. Can we close this now?

barbalberto commented 7 years ago

Yes :+1:

robotology / yarp

Define standard for depth-camera and kinect-like devices #643

SKELETON TRACKING