RGB Pointcloud slows down the publishing rate a lot

RoseFlunder commented 5 years ago

Describe the bug When enabling the rgb_point_cloud the performance of the node degrades a lot. Without the colorized pointcloud the node can maintain a steady 30 hz publishing rate when using 30 FPS with the kinect. With the rgb_point_cloud enabled the publishing rate decreases to like 10-12 Hz. When visualizing the point cloud in RViz the slow publishing rate leds to a laggy view. RViz itself maintains high enough FPS displaying the pointcloud but the data comes in too slow. When using the k4aviewer there is no slowdown in colorized point cloud view.

To Reproduce Steps to reproduce the behavior:

Start the node via the driver.launch and use fps:=30, depth_mode:=NFOV_UNBINNED, rgb_point_cloud:=false
Call "rostopic hz points2" in another terminal
Restart the node with rgb_point_cloud:=true
Call "rostopic hz points2" again and compare the rate with step 2

Expected behavior Maintain the 30 FPS rate for publishing the rgb pointcloud on machines that can easily view the colorized point cloud at 30 FPS in the k4aviewer.

Desktop (please complete the following information):

OS: Ubuntu 18.04

skalldri commented 5 years ago

Confirmed this is occurring on my ROS Melodic, Ubuntu 18.04 machine. I suspect the main image processing loop is running too slowly on the CPU, since the K4A Viewer does the RGB point cloud math on the GPU.

I'll investigate ways I could do this faster, potentially using SIMD instructions.

skalldri commented 5 years ago

Confirmed, the image processing loop take between 40-70ms when the RGB point cloud is enabled. I'll look for ways to potentially speed it up.

RoseFlunder commented 5 years ago

@skalldri Is a GPU accelerated version planned? In the usage.md this is stated but there isn't any GPU acceleration in the currently used SDK methods I think?

Using the point cloud functions in this node will provide GPU-accelerated results, but the same quality point clouds can be produced using standard ROS tools.

Today I saw this example in the sensor sdk repository: Fast Pointcloud Example Would this version already be faster even on CPU or is it comparable? I guess this algorithm is implemented on the GPU for the k4aviewer isn't it?

Tomorrow I am planning to use the CPU version of this algorithm in the node to check its performance compared to the current version. If I get it working I will create branch in my fork for it.

skalldri commented 5 years ago

The fast point-cloud example does not produce RGB point clouds, just regular point clouds. It's just a way to produce point clouds quickly without using SSE3.

My documentation in usage.md is incorrect: we use SSE3-accelerated point cloud math, not GPU-accelerated. See:

https://github.com/microsoft/Azure-Kinect-Sensor-SDK/blob/c0800a43fb42200935aa9a8da58b8bd5db3d55c9/src/transformation/rgbz.c#L1210

Yes: K4A Viewer does this all on the GPU. Since K4A viewer is going to be drawing everything to the screen as the final destination (rather than writing it to a ROS message), it makes sense for it to do as much work as possible in a shader. They wrote a shader that accepts a depth image and color image as input, as well as the X-Y tables, and then naively draws the RGB point cloud. That's a very fast operation so it runs super quickly.

Copying the point cloud and RGB frame to the GPU, doing some work, then copying it back off the GPU is going to take time. I'd like to take a stab at improving the speed of the CPU-based version before resorting to

RoseFlunder commented 5 years ago

Thanks for the detailed answer :) I am looking forward to it because thats one of our main blocking points at the moment in our project.

RoseFlunder commented 5 years ago

@skalldri So like I already mentioned in another issue I did some time measuring. In this gist I just pasted the modified part of the rgb point cloud method together with the output. I don't know if ROS Time is very accurate but we can clearly see that the SDKs methods are fine performance wise. Whats really dragging down is the loop that creates the message: https://gist.github.com/RoseFlunder/0cca13211022049ae7abc98ea600a7de

EDIT: Building the package in release mode drastically speeds this loop up. Use: catkin_make -DCMAKE_BUILD_TYPE=Release The loop is about 3 times faster with this (~14ms instead of ~45ms in my example). I added the output for this built also to the gist.

EDIT2: Just did another test what happens when replacing local variables, like BgraPixel and depth_point_3d, with only expressions from the buffer and putting the divide by 1000 part also in the expression. That gives about 1ms on average. So the local variables for better readability aren't a big deal.

skalldri commented 5 years ago

Thanks for digging into this issue RoseFlunder. I suspect the iterator used to fill up the PointCloud2 message is very expensive. There is almost certainly a faster way to copy the data from the SDK structures into the PointCloud2 message.

Unfortunately, I'm on vacation until the 25th so I won't be able to make much progress on this until I get back.

skalldri commented 5 years ago

Going to start having a look at this.

skalldri commented 5 years ago

EDIT:

Never mind, the original code has a simple mistake: not calling pcl_modifier.resize() before trying to modify the point cloud! Using the sped-up branch helps, but still doesn't get us over the finish line. Digging further into Valgrind on the latest code.

ORIGINAL:

Well, Valgrind seems to think that using PointCloud2Iterator is very expensive:

valgrind-iterators

Combined, PointCloud2Iterator::operator++ and PointCloud2Iterator::operator* account for ~20% of the time spent in getRgbPointCloud()! While not confirmed, this makes me think that we could see significant speedups if we do this the old fashioned way.

I'll see if I can build a version of the point cloud functions that doesn't use the iterator.

skalldri commented 5 years ago

It's good to see that this is using memset_avx2_erms: that's probably the compiler optimizing the PointCloud2 alloc.

emmanuel-senft commented 4 years ago

Hi, I am also affected by this bug, and I was wondering if there had been any progress on that side?

anastasiabolotnikova commented 4 years ago

Hello, glad to know that this issue is getting attention and already started to be investigated. Thanks @RoseFlunder @skalldri for sharing your findings!

This is indeed a limitation of the current driver implementation that should be addressed to speed up the loop cycle. Both fillColorPointCloud and fillPointCloud could be much faster if going over the entire point cloud sequentially point by point is somehow avoided.

Three questions that I would like to ask regarding this issue:

Is there no way of converting 3D points coordinates data from K4A SDK structure (e.g. buffer) into sensor_msgs::PointCloud2.data binary blob using memcpy? I suppose that would be the fastest way to deal with it on machines without GPU?
Why new PointCloud2 is being created and resized on every cycle? I suppose it only needs to be created and set to correct size once before the while loop of framePublisherThread?
A parallel implementation of fillColorPointCloud and fillPointCloud for machines with GPU would be great! Especially if there is no way to copy points from SDK structure into ROS message using memcpy. Could you share with us any information regarding the plans of this implementation?

Thank you in advance for the reply!

ooeygui commented 4 years ago

Hi All, We Azure Kinect to address a few performance related issues, as well as Migrating to ROS2 on our backlog, but likely won't be able to start working on it until the summer.

YoshuaNava commented 4 years ago

@skalldri I'm also having issues with the PointCloud2Iterator, although I'm not a ser of Kinect Azure. Do you still plan to optimize it? I would be happy to collaborate and find out better ways to use it, or come up with improvements.

ooeygui commented 4 years ago

Hi @YoshuaNava , This work is still on the backlog.

The code currently does this work linearly in C++ https://github.com/microsoft/Azure_Kinect_ROS_Driver/blob/c8d4cb7f1d43326879f926294fb1aa4f299c159e/src/k4a_ros_device.cpp#L586.

We are looking at Cuda for this codepath when available.

YoshuaNava commented 4 years ago

@ooeygui Thank you for your prompt answer.

I see. I have continued using the iterator. Until now it allowed me to speed up an old raw deserialization method by 3-5x, which is a good gain with considerable low effort.

I'll share any insights found while I work on this.

ibrachahine commented 2 years ago

Hi, I am prototyping for an application requiring the depth topic on ROS and it seems to be very slow at 1Hz. Is there any recent updates on this issue?

Thanks

ooeygui commented 2 years ago

@ibrachahine Thank you for your interest and ask here. We have not invested in fixing this bug, but would accept a pull request. Have you tried workign with parameters on the node to optimize for your hardware and scenario?

microsoft / Azure_Kinect_ROS_Driver

RGB Pointcloud slows down the publishing rate a lot #56