Question about potential for more optimizations

YoshuaNava commented 5 years ago

Hi developers, First of all, thank you very much for implementing and maintaining libpointmatcher. It's an excellent library for point cloud registration and filtering. Its code quality, number of features and performance are really great.

I'm a power user of pointmatcher, and for the platform that we're using it on we want to explore some performance optimizations. to reduce the load from ICP and point cloud filters. I was wondering about the possibility of evaluating performance and applying optimizations where needed (e.g. by means of intrinsics, compiler flags, OpenCL), to accelerate the rate at which ICP and filters can be run.

In this regards, could you give me any pointers to what has been tried before? And also your opinion about the addition of OpenCL/intrinsics code?

Furthermore, I have read arguments from multiple authors about the possibility of optimizing ICP by means of parallelism. Some are in favor, while some argue that doesn't really make a big difference. What are your thoughts on this?

Thank you very much in advance.

pomerlef commented 5 years ago

I've seen a couple of people implementing ICP on a GPU and loading the point clouds on the GPU takes a lot of time, which reduces a lot the performance gain.

You could start by having an analysis of which module cost the most in term of computation time. My expectation is that the matching module takes about 80 % of the computation time. We already rely on libnabo, which used OpenMP to deployed multiple threads. I'm not aware of a solution for an approximate kd-tree working on GPU.

That being said, adding some OpenMP preprocessor directive could help for the DataPointFilters too. If you find out something, we will welcome any gain on performance!

YoshuaNava commented 4 years ago

@pomerlef Thank you very much for your input.

We are quite interested in taking this forward, and have two ideas on feasible optimizations to libpointmatcher:

Some filters seem to random shuffle input point clouds every time they run (Credits to @jsburklund). Have you considered random shuffling only once for a given chain of filters?
We were considering running libpointmatcher and a tool such as Intel Advisor to identify which blocks of code can benefit from vectorization. To which extent has this been explored during the development of libnabo and libpointmatcher?

pomerlef commented 4 years ago

Filters related to downsampling will change the order of points. They should actually squeeze them by removing deleted points. The ones doing descriptor augmentation should not change the order. If it's the case, then it's a bug.
We did the optimisation manually bit per bit without out a specific tool. We would be happy to have you look through it more systematically. Any gain on computation time will help everybody.

YoshuaNava commented 4 years ago

@pomerlef I understand. Thank you for your reply.

Regarding point 1, I will take a closer look at the filters we are using very soon.

About point 2, we are planning to do this, but in small pieces and over this year. I will get back to you with results as soon as we have them.

Something else I have been considering, and I wanted to mention as item 3: Have you considered the possibility of using organized point clouds? Filters like SurfaceNormal could save significant computation time if the point clouds are organized, as you would not have to run K-NN to find neighboring points.

I've noticed that many packages implicitly organize point clouds to extract environmental features (e.g. LOAM), to extract planes (Example), some lidar drivers allow you to get organized point clouds or at least projections nowadays (e.g. Velodyne, Ouster), and there are also multiple benefits on having a flat representation of point clouds that can be traversed in any way and reprojected into other sensors easily.

pomerlef commented 4 years ago

You're turning every rocks for optimization ;)

So, for organized point clouds, they tend to do a projection on a 2D grid to improve speed. Doing a projection imply loosing information. This work only for a single point cloud from a sensor, not for the map as there is no focal point. When using those solutions, we loose generalization for computation speed (typical engineering challenge).

I see the use of kd-tree and octree as a way to organize point clouds, but they are indeed costly.

We can add those projections, but we then need also to make it explicit to the user that those filter can only work on sensor inputs, not on generic maps.

YoshuaNava commented 4 years ago

@pomerlef Thank you for your reply.

Indeed, I'm looking at all the options before starting with any implementation. Regarding the topic of organized point clouds, I will start a separate issue to discuss more in depth how we could represent them, if that's ok for you.

(The following point will diverge from the optimizations discussion, but I think it makes sense to mention it) About the map representations, I agree that organized point clouds in the form of projections are not usable. But I would also say that keeping unorganized point clouds as map representations is not ideal. You need to discard a lot of data to keep unorganized point cloud maps manageable by an online SLAM system.

I think a better map representation would be something that

Allows fine scale measurements to be registered, so your map density and accuracy is as high as you want. Your localization can then expect roughly the same amount of support in terms of point density in different situations.
Allows you to update sections of the map your sensor most likely observed, instead of the whole thing. This way your resource requirements remain feasible for real time operation. Thus, I think a hybrid between organized and unorganized point clouds is desirable, something like voxels containing point sets, or TSDF, with ray-tracing used to judge which parts of the map will be updated.

pomerlef commented 4 years ago

Part of those were implemented at the application level in ethzasl_icp_mapper.

YoshuaNava commented 3 years ago

@pomerlef Thank you for the reference.

An additional question: do you think there could be some way to save and re-use a KD-tree? KNN is used in different places (ICP Matching, Overlap estimation for extending the map, dynamic points detection -from your work with P. Krüsi). Even though libnabo and libpointmatcher are quite fast generally, the KNN step has a fairly large footprint in terms of resources.

After a quick search I found a blog post from a VCG lib developer that tried this. He particularly uses the KD-tree to match points to a mesh. He sees around 3x speed-up. If this were possible for SLAM, one of the main questions to solve would be how to update the KD-tree when the map is extended.

pomerlef commented 3 years ago

From what I recall, libnabo only use pointers, so the memory footprint should be small.

I'm not sure of what you have in mind for saving the kd-tree. You want to 1) put it on disk or 2) you want to make the data structure available to other modules? The second one crossed my mind a couple of times, but I couldn't find a solution where we don't lose the modularity of the library. There could be some methods to feed a tree build from outside, but we should be general enough to allow the use of something else than libnabo or a kd-tree.

norlab-ulaval / libpointmatcher

Question about potential for more optimizations #342