Open YoshuaNava opened 4 years ago
I will try to optimize 3 of the loops in the deserialization function. In order of resource consumption:
Based on this article Eigen matrices are stored in column-major order. It could be very productive to write up the data in column order, as columns are stored (and loaded) contiguously. A loop of this type would be easier to unroll or vectorize for the compiler.
I'm unsure about the current use of padding, so I'm going through the code atm.
I'm going to tackle this in the next 2 weeks. I'll give updates on my progress.
great! something that is awfully missing on the ROS side of libpm are unit tests. If you're doing some for this work, let us know so we can integrate them properly.
For padding, libpm use homogeneous coordinates to apply rigid transformation to the whole point cloud using matrix multiplication instead of a loop.
Hi, As part of my efforts to benchmark libpointmatcher, I ran a ROS node that employs libpointmatcher_ros/point_cloud (the old version from ethzasl_icp_mapping) to serialize and deserialize point cloud data. I implemented a ROS node that receives a point cloud message, deserializes it, and applies a few filters, to finally publish the resulting point cloud, run for 100+ seconds.
I found head-first that the most expensive method called in my program (even more than a surface normal data points filter run every iteration) was
rosMsgToPointMatcherCloud(sensor_msgs::PointCloud2, bool)
from point_cloud.cpp.I used Intel VTune community edition for finding hotspots and Intel Advisor for vectorization advice. In the following lines I describe my search for hotspots and a short analysis.
Hotspots
CPU
Memory access
Memory writing
Vectorization advice
Analysis
(The code in this repo might not be the same as the one from ethzasl_icp_mapping. I'll try to update my analysis)
I found 3 main CPU-time hostpots:
In terms of memory access, number 1 from the above list is also a strong hotspot. When it comes to memory writing, all paged memory is cleared by the function, and the allocations are neither big or too many (comparing to other methods, e.g. ROS TCP)
Intel Advisor recommends optimizing [the "RGB loop"]https://github.com/norlab-ulaval/libpointmatcher_ros/blob/master/src/PointMatcher_ROS.cpp#L113) first of all, the cuatri-loop described in point 1 of the CPU hotspots, as well as a loop in libnabo.