Cuda interface mallocs and pybind11

I have gotten the cuda interface to work in python now, and I am seeing a lot of memory allocations when I call the dll. It's way faster than what I was doing before.

I was messing around with the Nvidia Performance Primitives, and many of these functions require a scratch buffer that is pre allocated - So I can just pass an already allocated array to the NPP dll, and then all of the memory is set up.

So what I am asking is the following:

Am I correct in assuming that the cuda interface does not have pointers for all of the things that it needs to do the calculation?

Is there theoretically a way to change the interface so that you can precompute the amount of "scratch buffer" needed for gpufit, and then eliminate some of the memory allocations?

looking at the traces I am getting about 50% of a call to gpufit's cuda interface as being memory allocations, while with the NPP stuff I never see any memory allocations... because i give it a preallocated scratch buffer.

I'm willing to look into this myself and contribute; it's just helpful to hear thoughts on this.

EDIT: oh, pybind11. Are there any performance advantages to using pybind11 instead of ctypes? I just used ctypes but I was interested in the idea that pybind11 requires you to modify cpp source, so I would assume the integration is a little bit tighter.

gpufit / Gpufit

Cuda interface mallocs and pybind11 #118