moments.c contains the function momEvalMomr(), which has yet to be converted to the CUDA code. If the double precision performance is reasonable on Keplers, try making a CUDA version of this code.
On the other hand if CUDA double precision is really poor, we should try switching to a single precision friendly version of the multipole moment code. This is on the trq/scaledmoments branch, and involves the function momEvalFmomrcm() in moments.c
moments.c contains the function momEvalMomr(), which has yet to be converted to the CUDA code. If the double precision performance is reasonable on Keplers, try making a CUDA version of this code.
On the other hand if CUDA double precision is really poor, we should try switching to a single precision friendly version of the multipole moment code. This is on the trq/scaledmoments branch, and involves the function momEvalFmomrcm() in moments.c