norlab-ulaval / libnabo

A fast K Nearest Neighbor library for low-dimensional spaces
http://norlab-ulaval.github.io/libnabo/
BSD 3-Clause "New" or "Revised" License
440 stars 144 forks source link

One of a few experiments I did with moving libnabo to CUDA #33

Open LouisCastricato opened 9 years ago

LouisCastricato commented 9 years ago

As stated before, all attempts were unsuccessful as I was not able to beat nor match libnabo CPU performance.

I'm passing this project onto a colleague of mine who has been working with KD trees and KNN for nearly 10+ years. I hope he can do a much better job than I did.

The code isn't linked to C or C++ code. The port I wrote for that was very messy and was consistently changing due to the various approaches attempted. I'm only putting this here as a proof of concept in case anyone wants to tinker around with it.

Dustin, my colleague, will be writing a new C++ wrapper in the coming month when he does his attempt.

I don't know what his approach will be, but his fork is here: https://github.com/dustprog/libnabo

ethzasl-jenkins commented 9 years ago

Can one of the admins verify this patch?

LouisCastricato commented 9 years ago

Oh... I didn't mean to push the change to the read me... Um... Should I create a cudareadme.md?

LouisCastricato commented 9 years ago

https://github.com/ethz-asl/libnabo/issues/29 Linking the thread I had started about this.

My experience with implementing a KD search in CUDA is that no matter what I did thread divergence was always the biggest overhead. In a best case scenario the threads were idle about half the time. In a worst case scenario there were idle 31/32 of the time.

My general predictions is that KD tree search algorithms like what libnabo uses will not be as effective in a GPGPU environment until something like NVIDIA's Volta releases, which would allow the usage of an on-board ARM chip to queue new work for the GPU and therefore potentially reduce warp divergence.

FLANN's approach to CUDA seemed interesting, but the error was far too large for it to be practical for anything more than super basic applications.

Unless Dustin sees something that I didn't, I'm going to say that doing this on the GPU may not be very viable for a bit longer (2 years perhaps?). And even by then, the monstrous improvements in cache performance that Intel Skylake promises may further reduce the potential performance of GPGPU KD Search over CPU KD Searching.

simonlynen commented 9 years ago

@MatrixCompSci thanks a lot for sharing your insight, thoughts and code. Great!

LouisCastricato commented 9 years ago

The one thing I really want to try still is doing this on a Xeon Phi. I'm generally curious haha