[x] use uint32_t instead of 64, since my data size never exceeds 1<<32.
~[ ] refactor code could help. get rid of the device function; remove as many as temp variable as possible.~
[x] Port the GPU code onto CPU.
Grid implementation on CPU is faster than GPU.
Another example of something you could try is moving
const float ux = x[u], uy = y[u];
to after the __syncthreads(); . Then you might be able to reuse a register that you don't need anymore, rather than holding it in registers for 12 lines of code. (This is generally advised in c++ anyway, that you don't declare variables until the point that you actually need them, as it's easier for the reader to see all the code that's relevant all at once.)
grid is stupid. i am stupid.
sorting by l1 is the king.