Closed ghost closed 6 years ago
Hey @marchingcubes, sorry for a dely replying. Really happy to hear that you were brave enough to try this out! Of course the first thing @cwiiis and I did after pushing our work out of the door was to take some xmas + new year holiday ;-p
Both of us have been developing and testing with 14" Razerblade 2017 laptops (Kaby Lake i7-7700HQ) at ~2.8GHz) and running with interactive performance (I think approx 15fps). Although you have an older gen CPU, it's clocked higher than the systems we're using. I do have a Haswell laptop here, so I could give that a try too.
Certainly we know we have lots of work left to do on the optimization side of things, and @cwiiis has been experimenting quite a bit with the segmentation approach which is currently what we're creating the voxel grid for, but anything taking 600ms suggests something has gone badly wrong for some reason, that's not the performance we'd expect even at this early stage :/
I have to check that you're running a release build? (I guess you are though)
Even if you built with --buildtype=release
, it could also be worth building with CFLAGS="-march=native -mtune=native" + same for CXXFLAGS if you didn't.
Something else that might help ensure we're comparing the same things would be to build with -Duse_system_libs=false
when you run meson to configure the build. Building with that option will take longer as it will compile most of our dependencies too including libpcl which takes a bit of time. If it's still taking 600ms for generating the voxel grid then we at least know we're talking about the same version of libpcl that we've been testing with.
For reference these are comparable timings I just captured on my laptop:
ctx: Requesting new frame for skeletal tracking
ctx: Waiting for new frame to start tracking
ctx: Started resizing frame
ctx: Frame scaled to 1/2 size on CPU in 311.267us
ctx: Starting tracking iteration (473480931)
ctx: Projection (307200 points) took (2.217ms)
ctx: Reprojection of colour information took (2.024ms)
ctx: Cloud has 67369 points after depth filter (1.996ms)
ctx: Cloud has 1835 points after voxel grid (3.890ms)
ctx: Cloud possible floor subset has 1507 points
ctx: Looking for ground plane took (174.989us)
ctx: Ground plane has 892 points (66.497us)
ctx: Euclidean clustering took (1.303ms)
ctx: People detection took 3.719ms
ctx: Person reconstruction (943 points) took (12.160ms)
ctx: depth intrinsics: w=640, h=480
ctx: People Detector: re-projected point cloud in 301.157us, w=172, h=224
ctx: People Detector: starting label inference: n_trees=3, w=172, h=224, data=0x7f1420717f20
ctx: People Detector: ran label probability inference in 8.696ms
ctx: People Detector: calculated pixel weights in 912.440us
ctx: People Detector: inferred joints in 1.548ms
ctx: Created RGB label map in 1.575ms
ctx: Reprojected RGB label map in 557.319us
ctx: Finished skeletal tracking
Certainly curious to get to the bottom of this, but applogies if there might still be a little of delay to replies for a few weeks.
It is possible I am running a debug build, this is the only package i have ever built that used meson, so i am just using whatever it defaults to. I just thought i would clarify what the performance expectations were before I spent any time trying to fix those issues.
I will try building with those suggested fixes, and post updated timings. I doubt the CPU itself is underpowered.
FYI I have been working on some VR stuff with the Kinect myself, and this kind of positional tracking would enable some very cool avatar-related functionality.
This is my experimental Vulkan engine running with OpenVR, displaying kinect-captured pointclouds streamed over the network: https://twitter.com/beVR_nz/status/937465021960880130
Scroll back/forward through my twitter devlog to see some other features.
The work is not currently open source, its just my personal experimentations - but probably will be once I get the render to run on AMD (it currently crashes RADV badly) and a core set of features implemented.
Just a quick comment on accuracy too. The models published are really our first set of trained decision trees that we felt worked 'well enough' for us to be able to switch our focus over to the code for inference, segmentation and on getting this going on Android + Unity.
There are some low hanging improvements that should improve the training side of things. For example we should implement some form of boosting to train our second and third trees by considering the training samples that aren't handled well by previous trees. We can likely improve our Blender rendering script to reduce the bais toward certain poses like walking forward quite so much. For reference the renderer for training data is mostly limited to front-ish facing poses so it won't do well for side/back poses at all for now. The camera positioning for rendering our training data currently ensures that limbs are very rarely cut of by the edge of the frame, so the decisions will currently be terrible in those conditions. @cwiiis could comment more on the segmentation side of things, but I believe we probably don't do too well at the moment if we can't see some of the floor.
So generally I'd probably warn against having high expectations for tracking accuracy at this stage, but we're optimistic we can expect to see it improvements when we shift our focus back around to training again.
Of course if anyone else wants to take a stab at some of the training improvements we can think of, or try out other ideas we'd be very happy to help!
Thanks for the pointer to your OpenVR work @marchingcubes, looks cool.
I really got into VR some years back with similar use cases in mind too. At the time of the DK2 I was looking down and wishing I could see my hands and keyboard as re-projected point clouds. Ended up working a bit on RealSense drivers so I could use those cameras on Linux, but didn't stick to it long enough.
I'm currently spending some of my 'holiday' reverse engineering the Lenovo Explorer HMD to hopefully get depth data from the stereo cameras (essentially also providing a kinect-like point cloud) which might end up being of some interest to you too :) Accessing the video stream for the IR cameras (which are stitched together into a single frame split horizontally for the left/right eye) was surpisingly easy to start with, but really there's a lot of image processing that needs implementing to really make use of that for tracking + lots of usb protocol to unearth.
Cool.. yes that is definitely of interest. In addition to Vive/openVR I have been doing a few bits with the PSVR on Linux, as it is cheap and the ergonomics are very nice. It works surprisingly well with IMU-only tracking (no solution for optical tracking yet) using OpenHMD, and they are working on the Windows mixed reality stuff too, though I am not sure how fast the work is progressing. for example see:
https://github.com/OpenHMD/OpenHMD/tree/acer-ah100/src/drv_acer
I believe there is only a tiny subset of the usb stuff working but it might be helpful to bootstrap your own work on this?
Ah, I hadn't noticed anyone looking at WMR headsets yet so thanks for the pointer - as you say it doesn't look like they got very far yet but can get in touch with them. I'm working with a different headset but I wonder how standard the various WMR headsets will turn out to be.
I rebuilt the app with the release mode parameters and things seem much better now.. still not tracking my body solidly, but i have spent very little time with it beyond just verifying it runs and the timings are better..
i'm going to make a corresponding rig for the skeletal animation setup in my engine and try getting some glimpse-captured poses into the engine as a first step..
ctx: Requesting new frame for skeletal tracking ctx: Waiting for new frame to start tracking ctx: Started resizing frame ctx: Frame scaled to 1/2 size on CPU in 299.700us ctx: Starting tracking iteration (1038075162) ctx: Projection (307200 points) took (2.702ms) ctx: Reprojection of colour information took (3.088ms) ctx: Cloud has 162989 points after depth filter (5.651ms) ctx: Cloud has 8518 points after voxel grid (8.860ms) ctx: Cloud possible floor subset has 2513 points ctx: Looking for ground plane took (365.431us) ctx: Euclidean clustering took (14.372ms) ctx: People detection took 13.644ms ctx: Person reconstruction (8518 points) took (3.546ms) ctx: depth intrinsics: w=640, h=480 ctx: People Detector: re-projected point cloud in 86.947us, w=172, h=224 ctx: People Detector: starting label inference: n_trees=3, w=172, h=224, data=0x7fb724cbe910 ctx: People Detector: ran label probability inference in 5.856ms ctx: People Detector: calculated pixel weights in 779.696us ctx: People Detector: inferred joints in 1.444ms ctx: Created RGB label map in 1.418ms ctx: Reprojected RGB label map in 90.049us ctx: Finished skeletal tracking
Good to hear. Those timing look a lot better, though we can still see we're not yet where we want to be.
Btw, it'd be good for us if you wouldn't mind sharing a fakenect recording of the situation where the tracking doesn't seem solid yet. We don't currently have a very diverse set of real-world test data.
Regarding the skeleton, I thought I'd mention that we use the makehuman 'Game Engine' rig for for rendering our training data and so some of the joints hopefully match up reasonably with that one. Some of these notes on how we import makehuman models into Blender might be of some use - or maybe you could take a look at our glimpse-training.blend file where you'll also find that rig. It might be better to roll your own rig and find a way to retarget, I'm really not sure at this point, but maybe worth considering.
I think I'll close this bug soon considering that the most glaring 600ms performance issue was at least resolved, otherwise it'll a bit too open-ended for us to track all perf/accuracy issues here. Please open more issues though as you find problems.
We need to make a Google Group or some such general discussion forum soon but in the mean time also feel free to DM me on twitter (@robertbragg) or email me (robert@sixbynine.org) if an Issue doesn't seem like a good fit.
Sure thing, please go ahead and close the bug, it is the wrong place to do discussion. I have followed you on twitter, and will create some recordings etc. when i am next working on this stuff.. the christmas/new year break has taken me away from my main workstation.
Great to see someone trying this out! I'm currently working on some speed-ups that I hope will enable real-time tracking on more modest hardware (hopefully at least 1.5x realtime on the Razer laptops @rib mentioned above), though proper optimisation is something we've not really concentrated on yet.
Currently, tracking happens on a single thread, all on the CPU. Once we've experimented enough to find the techniques and algorithms that scale, I expect we'll begin working on parallelisation and GPU utilisation. Any fakenect recordings you make would be very useful in helping us determine what does and doesn't work, and helping us establish the limitations of this code, thanks for taking a look!
As has been agreed, I'll close this bug for now, but don't hesistate to open any issues as you find them, or contact us directly. My e-mail is in the commit logs, and I'm also on Twitter @cwiiis.
I was able to get the glimpse_viewer application to compile and run with the provided training data, however performance seems very slow, and the system usually cannot identify my body in the image.
This is some debug output below with timings (running on a 3.5Ghz i7-4770), and i guess the voxel grid process is by far the heaviest, taking 600ms per frame, which seems too slow to be usable for realtime tracking.
Am I doing something wrong, do my parameters need tuning, or is this the level of performance to be expected at this stage of the project?
ctx: Requesting new frame for skeletal tracking ctx: Waiting for new frame to start tracking ctx: Started resizing frame ctx: Frame scaled to 1/2 size on CPU in 4.514ms ctx: Starting tracking iteration (3802357927) ctx: Projection (307200 points) took (12.405ms) ctx: Reprojection of colour information took (12.873ms) ctx: Cloud has 242371 points after depth filter (32.858ms) ctx: Cloud has 10348 points after voxel grid (600.392ms) ctx: Cloud possible floor subset has 4755 points ctx: Looking for ground plane took (30.296ms) ctx: Ground plane has 1198 points (1.233ms) ctx: Euclidean clustering took (69.812ms) ctx: People detection took 64.398ms ctx: Person reconstruction (9150 points) took (85.734ms) ctx: depth intrinsics: w=640, h=480 ctx: People Detector: re-projected point cloud in 1.337ms, w=172, h=224 ctx: People Detector: starting label inference: n_trees=3, w=172, h=224, data=0x7f4f7527a670 ctx: People Detector: ran label probability inference in 17.161ms ctx: People Detector: calculated pixel weights in 3.262ms ctx: People Detector: inferred joints in 3.913ms ctx: Created RGB label map in 3.654ms ctx: Reprojected RGB label map in 1.077ms ctx: Finished skeletal tracking