norlab-ulaval / libpointmatcher_ros

A bridge between libpointmatcher and ROS
5 stars 4 forks source link

Hotspots in function rosMsgToPointMatcherCloud(PointCloud2, bool) #2

Open YoshuaNava opened 4 years ago

YoshuaNava commented 4 years ago

Hi, As part of my efforts to benchmark libpointmatcher, I ran a ROS node that employs libpointmatcher_ros/point_cloud (the old version from ethzasl_icp_mapping) to serialize and deserialize point cloud data. I implemented a ROS node that receives a point cloud message, deserializes it, and applies a few filters, to finally publish the resulting point cloud, run for 100+ seconds.

I found head-first that the most expensive method called in my program (even more than a surface normal data points filter run every iteration) was rosMsgToPointMatcherCloud(sensor_msgs::PointCloud2, bool) from point_cloud.cpp.

I used Intel VTune community edition for finding hotspots and Intel Advisor for vectorization advice. In the following lines I describe my search for hotspots and a short analysis.

Hotspots

CPU

Screenshot_2020-07-20_20-12-38

Memory access

Screenshot_2020-07-20_20-14-37

Memory writing

Screenshot_2020-07-20_20-18-39

Vectorization advice

Screenshot_2020-07-20_20-27-49

Analysis

(The code in this repo might not be the same as the one from ethzasl_icp_mapping. I'll try to update my analysis)

I found 3 main CPU-time hostpots:

  1. Casting of contiguous values from an array to fill up features/descriptors is done in a cuatri-loop (ros Msg fields -> point cloud height -> point cloud width -> data length). This takes ~12% of CPU time. (Line 194
  2. Pre-filling of empty point cloud containers with "padding" values. (This takes ~3% of CPU time. (Line 92

In terms of memory access, number 1 from the above list is also a strong hotspot. When it comes to memory writing, all paged memory is cleared by the function, and the allocations are neither big or too many (comparing to other methods, e.g. ROS TCP)

Intel Advisor recommends optimizing [the "RGB loop"]https://github.com/norlab-ulaval/libpointmatcher_ros/blob/master/src/PointMatcher_ROS.cpp#L113) first of all, the cuatri-loop described in point 1 of the CPU hotspots, as well as a loop in libnabo.

YoshuaNava commented 4 years ago

Proposals

I will try to optimize 3 of the loops in the deserialization function. In order of resource consumption:

Data reading cuatri-loop

  1. As the datatype of every field is known in advance (thanks to the iterator it), remove the switch case from the inner loop. The data type of the pointers would be based on the a priori knowledge of it.
  2. Pre-declare the data field length based on the datatype.
  3. Fill the view of the matrices by the stride of the data (row_step).
  4. Move the declaration of fPtr out of the width-loop so that we (hopefully) block that variable in cache.
  5. Change or enforce using the x iterator. It is currently unused.

Based on this article Eigen matrices are stored in column-major order. It could be very productive to write up the data in column order, as columns are stored (and loaded) contiguously. A loop of this type would be easier to unroll or vectorize for the compiler.

Padding after output container creation

I'm unsure about the current use of padding, so I'm going through the code atm.

RGB filling loop

  1. Currently the x iterator is unused. I would propose to either change it (iterate at point_step) or start using it actively.
  2. Set the value of the vector in column major order.
  3. Remove the conditional from the loop, it could be creating a branch that prevents vectorization.
YoshuaNava commented 4 years ago

I'm going to tackle this in the next 2 weeks. I'll give updates on my progress.

pomerlef commented 4 years ago

great! something that is awfully missing on the ROS side of libpm are unit tests. If you're doing some for this work, let us know so we can integrate them properly.

For padding, libpm use homogeneous coordinates to apply rigid transformation to the whole point cloud using matrix multiplication instead of a loop.