I have gotten the cuda interface to work in python now, and I am seeing a lot of memory allocations when I call the dll. It's way faster than what I was doing before.
I was messing around with the Nvidia Performance Primitives, and many of these functions require a scratch buffer that is pre allocated - So I can just pass an already allocated array to the NPP dll, and then all of the memory is set up.
So what I am asking is the following:
Am I correct in assuming that the cuda interface does not have pointers for all of the things that it needs to do the calculation?
Is there theoretically a way to change the interface so that you can precompute the amount of "scratch buffer" needed for gpufit, and then eliminate some of the memory allocations?
looking at the traces I am getting about 50% of a call to gpufit's cuda interface as being memory allocations, while with the NPP stuff I never see any memory allocations... because i give it a preallocated scratch buffer.
I'm willing to look into this myself and contribute; it's just helpful to hear thoughts on this.
EDIT: oh, pybind11. Are there any performance advantages to using pybind11 instead of ctypes? I just used ctypes but I was interested in the idea that pybind11 requires you to modify cpp source, so I would assume the integration is a little bit tighter.
I have gotten the cuda interface to work in python now, and I am seeing a lot of memory allocations when I call the dll. It's way faster than what I was doing before.
I was messing around with the Nvidia Performance Primitives, and many of these functions require a scratch buffer that is pre allocated - So I can just pass an already allocated array to the NPP dll, and then all of the memory is set up.
So what I am asking is the following:
Am I correct in assuming that the cuda interface does not have pointers for all of the things that it needs to do the calculation?
Is there theoretically a way to change the interface so that you can precompute the amount of "scratch buffer" needed for gpufit, and then eliminate some of the memory allocations?
looking at the traces I am getting about 50% of a call to gpufit's cuda interface as being memory allocations, while with the NPP stuff I never see any memory allocations... because i give it a preallocated scratch buffer.
I'm willing to look into this myself and contribute; it's just helpful to hear thoughts on this.
EDIT: oh, pybind11. Are there any performance advantages to using pybind11 instead of ctypes? I just used ctypes but I was interested in the idea that pybind11 requires you to modify cpp source, so I would assume the integration is a little bit tighter.