Closed MMMHT closed 5 years ago
Here is what my code looks like : Lua. lua
required "mylib"
local vol = torch.cudaTensor(1000,500)
local out = torch.cudaTensor(1000,500)
mylib.foo(vol,out)
-- do something to out --
mylib.cu
int foo (lua_State *L){
THCState *state = getCutorchState(L);
THCudaTensor *input = (THCudaTensor*)luaT_checkudata(L, 1, "torch.CudaTensor");
THCudaTensor *output = (THCudaTensor*)luaT_checkudata(L, 2, "torch.CudaTensor");
foo <<<>>>(input, output)
return 1;
}
HI! For work required, I need to make a custom cuda function for torch, the problem is, calling custom function need more time than kernel really used. For more detail, I use sys.clock() to calculate time in torch and got 330ms, and I use cuda event to measure kernel time which is about 260ms, maybe it is used by data transfer because when I comment the kernel part, running custom cuda function also need about 70 ms , which is exactly the extre 70 ms, so I wonder if there is some way to transfer data using data pointer rather than torch.cudatensor so that I can copy data to GPU manually using cudaMemcpyAsync or something to save time?