Open Alevs2R opened 3 years ago
Thanks for letting me know this issue and providing nice benchmark results. When I wrote the code, I noticed that pcl::Registration
creates kdtree inside, but I thought its construction cost would be negligible. I'll disable the kdtree creation with force_no_recompute
.
Problem
I discovered that KDtree creation takes significantly more time when aligning with a big target cloud (with a map).
KDtree creation takes only 3ms for clouds with 17k points, but it takes 90ms for 370k points (my small 100x100 meters filtered map).
PCL creates kdtree inside "align" method, so it means that
ndt_cuda (P2D)
runs for 98ms instead of 8ms during first run.Moreover, PCL spends 20-30ms to calculate fitness score, regardless of the size of the target map. It is also not very good 😁.
Detailed benchmark results
**Aligning big target cloud with kdtree creation** ``` target:364444[pts] source:18564[pts] --- fgicp_st --- single:2293.17[msec] --- fgicp_mt --- single:460.209[msec] --- vgicp_st --- single:2132.44[msec] --- vgicp_mt --- single:381.7[msec] --- ndt_cuda (P2D) --- single:98.3882[msec] --- ndt_cuda (D2D) --- single:93.0577[msec] --- vgicp_cuda (parallel_kdtree) --- single:340.319[msec] --- vgicp_cuda (gpu_bruteforce) --- single:9404.37[msec] --- vgicp_cuda (gpu_rbf_kernel) --- terminate called after throwing an instance of 'thrust::system::detail::bad_alloc' what(): std::bad_alloc: cudaErrorMemoryAllocation: out of memory Aborted (core dumped) ``` **Aligning big target cloud without kdtree creation** ``` target:364444[pts] source:18564[pts] --- fgicp_st --- single:2224.49[msec] --- fgicp_mt --- single:312.474[msec] --- vgicp_st --- single:2044.39[msec] --- vgicp_mt --- single:293.93[msec] --- ndt_cuda (P2D) --- single:8.7521[msec] --- ndt_cuda (D2D) --- single:7.81053[msec] --- vgicp_cuda (parallel_kdtree) --- single:246.274[msec] --- vgicp_cuda (gpu_bruteforce) --- single:9536.63[msec] --- vgicp_cuda (gpu_rbf_kernel) --- terminate called after throwing an instance of 'thrust::system::detail::bad_alloc' what(): std::bad_alloc: cudaErrorMemoryAllocation: out of memory Aborted (core dumped) ``` **Aligning default clouds with kdtree creation** ``` --- fgicp_st --- single:263.107[msec] 100times:26208.3[msec] 100times_reuse:17901.3[msec] fitness_score:0.204374 --- fgicp_mt --- single:53.8672[msec] 100times:3846.4[msec] 100times_reuse:2469.65[msec] fitness_score:0.204384 --- vgicp_st --- single:185.224[msec] 100times:18208.6[msec] 100times_reuse:9911.83[msec] fitness_score:0.205022 --- vgicp_mt --- single:54.0197[msec] 100times:2933.43[msec] 100times_reuse:1687.04[msec] fitness_score:0.205022 --- ndt_cuda (P2D) --- single:8.74302[msec] 100times:861.372[msec] 100times_reuse:499.426[msec] fitness_score:0.197207 --- ndt_cuda (D2D) --- single:6.07843[msec] 100times:587.673[msec] 100times_reuse:315.464[msec] fitness_score:0.198976 --- vgicp_cuda (parallel_kdtree) --- single:31.5053[msec] 100times:2363.01[msec] 100times_reuse:992.915[msec] fitness_score:0.204697 --- vgicp_cuda (gpu_bruteforce) --- single:45.3529[msec] 100times:3512.59[msec] 100times_reuse:1685.24[msec] fitness_score:0.204693 --- vgicp_cuda (gpu_rbf_kernel) --- single:7.57619[msec] 100times:750.764[msec] 100times_reuse:309.824[msec] fitness_score:0.204584 ``` **Aligning default clouds without kdtree creation** ``` --- fgicp_st --- single:260.423[msec] 100times:25678.1[msec] 100times_reuse:17731.4[msec] --- fgicp_mt --- single:82.7624[msec] 100times:3546.36[msec] 100times_reuse:2466.41[msec] --- vgicp_st --- single:182.533[msec] 100times:17876.4[msec] 100times_reuse:9981.22[msec] --- vgicp_mt --- single:36.4721[msec] 100times:2744.13[msec] 100times_reuse:1607.6[msec] --- ndt_cuda (P2D) --- single:6.20965[msec] 100times:558.324[msec] 100times_reuse:472.443[msec] --- ndt_cuda (D2D) --- single:3.04992[msec] 100times:306.246[msec] 100times_reuse:303.85[msec] --- vgicp_cuda (parallel_kdtree) --- single:37.2568[msec] 100times:1809.56[msec] 100times_reuse:1083.63[msec] --- vgicp_cuda (gpu_bruteforce) --- single:42.7751[msec] 100times:3270.32[msec] 100times_reuse:1685.56[msec] --- vgicp_cuda (gpu_rbf_kernel) --- single:4.78256[msec] 100times:472.186[msec] 100times_reuse:306.657[msec] ```Simple solution
My solution is to set empty kdtree as search method and enable
force_no_recompute
flag before calling "align". \ Like this:You can easily repeat my experiment in fork, branch
test/disable-kdtree
. Command:Better solution
Better solution would be to replace PCL default kdtree with something more efficient 😊