microsoft / Azure-Kinect-Sensor-SDK

A cross platform (Linux and Windows) user mode SDK to read data from your Azure Kinect device.
https://Azure.com/Kinect
MIT License
1.49k stars 619 forks source link

How to merge two or more body tracking skeletons into one? #1460

Open knat opened 3 years ago

knat commented 3 years ago

a

k4abt_tracker_t masterTracker = ...;//tracker for master kinect
k4abt_tracker_t subordinateTracker = ...;//tracker for subordinate kinect

while(true){
    k4abt_frame_t masterFrame;
    if(K4A_WAIT_RESULT_SUCCEEDED == k4abt_tracker_pop_result(masterTracker, &masterFrame, 0)){
        k4abt_frame_t subordinateFrame;
        if(K4A_WAIT_RESULT_SUCCEEDED == k4abt_tracker_pop_result(subordinateTracker, &subordinateFrame, 0)){
            if(k4abt_frame_get_num_bodies(masterFrame) > 0 && k4abt_frame_get_num_bodies(subordinateFrame) > 0){
                k4abt_skeleton_t masterSkeleton, subordinateSkeleton;
                //simply assume only one person
                k4abt_frame_get_body_skeleton(masterFrame, 0 , &masterSkeleton);
                k4abt_frame_get_body_skeleton(subordinateFrame, 0 , &subordinateSkeleton);
                //todo: how to merge subordinateSkeleton into masterSkeleton based on joint confidence level?
                //if subordinate joint confidence level > master's, perform coordinate transform, replace master's
            }
        }
    }
}

How about three kinects? b A working sample is welcomed, thank you.

diablodale commented 3 years ago

I might approach this in several ways. The core challenge that I first see is a shared coordinate system. If you want a "merged" skeleton, then the coordinate system must be shared. There needs to exist a transformation that maps the local coordinate system of a Kinect into a shared coordination system of all Kinects.

The greenscreen example has somewhat related code https://github.com/microsoft/Azure-Kinect-Sensor-SDK/tree/develop/examples/green_screen and it describes how to calibrate the Kinects to create the coordinate system transformations. This isn't exactly what you need -- instead, a source of learning and inspiration.

To date, the Azure Body Tracking SDK does not accept raw data from more than one sensor directly to AI/DNN stage. Therefore, the stage at which you can merge skeletons is either: 1) before the capture is given to the Body Tracking API, 2) after x,y,z joints are output.

Option 1, it is possible to generate your own capture and pass it to the Body Tracking API. Technically, you could receive all the raw data from the multiple sensors, transform them into a shared coordinate system, add/remove/change data points in that raw data, then create a merged "better" capture, and send that custom capture into the Body Tracking API. This feels somewhat in the topic area of multi-camera SLAM. I suspect this is high difficulty and high compute cost. Unknown benefit.

Option 2, you can do the same calibration to have a shared coordinate system. Use the typical single Kinect approach to feed captures into the Body Tracking API. When each Kinect outputs its seen skeletons, transform all the joint coordinates into the shared coordinate system. Now you have joints/points that align (or not). You can use whatever math, logic, DNN, etc. you want/invent to choose which "left shoulder" point is the most accurate. It is a small set of points; only (Kinects bodies 25). It could be as simple as averaging the x,y,z values for confidence=high, with fallback logic for lower confidence. This is much easier than option 1 and lower computation cost. Unknown benefit.

Chris45215 commented 3 years ago

I want to add a precaution before putting body tracking captures from multiple sensors into the same body tracking API instance. It will create unexpected latency and desynchronization issues, no matter how good your graphics card is, if you use one instance for multiple sensors. In order to do body tracking with multiple cameras sharing one program, you should create a separate instance of the body tracking class/object for each sensor that your system uses.

This doesn't create much additional overhead. It just uses a bit more RAM, and the graphics cards have much more RAM than they need for body tracking.

szekelyisz commented 2 years ago

It'd be very nice to see option 1 implemented in the SDK itself rather than essentially having to hack the captures. I guess it would result in more accurate and reliable tracking than option 2. I'm thinking about an API call that would get an array of captures and the corresponding transformations and the SDK would do the rest. The body detection algorithm would be pretty much the same but work on a bigger data set (but that's just my speculation). This would enable the handling of an arbitrarily large area (limited by the maximum number of participating devices) as one space rather putting the burden of stitching the captures together on the user. I understand that synchronization could be an issue, but that's probably something that the user should handle. Is this likely to happen anytime in the future?

crystal-butler commented 2 years ago

I'm working with exactly the 3-Kinect example given by @knat, and have been using solution 2 as described by @diablodale. One of the main reasons for not attempting solution 1 has been that synchronizing multiple Kinects requires a time offset between devices.

In my experience, this offset can vary quite a bit across captures, and while the offset value is generally small it's not small enough to guarantee that bodies in motion won't be in slightly different positions across a set of synchronized captures. I've never tested trying to combine multiple point clouds and running k4abt_simple_3d_viewer or other body tracking against the result, but I wonder how well the body tracking model would identify bodies if a) point clouds aren't aligned perfectly (I've tried aligning them manually and using ICP, and so far haven't had highly accurate results when bodies are not the same distance from all 3 sensors) and b) if spatial differences in bodies due to the sync offset create blurry, uncertain body boundaries.

Alessandro-Minerba commented 2 years ago

Hi everyone, I'm working on the same issue. I'm using 3 Azure Kinect and I've found the transformation matrices between the two slaves and master device. Then I tried to: 1) create a merged Point Cloud to give to Body Tracking SDK, trying to create a new Capture object, but it is not possible; 2) transfrom the PointCloud Image of a camera point-to-point using the trasnformation matrices and then going back to a depth Image or a PointCloud Image, but it seems they both can't be used anymore to create a new Capture object to give to the Tracker. I'm in this problem for weeks without coming up with a solution

P.S. I don't think roto-translating every single Joint resulting from a single camera tracking would be a good option, because if a camera would estimate a wrong body pose it will damage all the others estimates in the merging process