torch / torch7

http://torch.ch
Other
8.96k stars 2.38k forks source link

custom c/cuda implement runs slow when called in torch #1178

Closed MMMHT closed 5 years ago

MMMHT commented 5 years ago

HI! For work required, I need to make a custom cuda function for torch, the problem is, calling custom function need more time than kernel really used. For more detail, I use sys.clock() to calculate time in torch and got 330ms, and I use cuda event to measure kernel time which is about 260ms, maybe it is used by data transfer because when I comment the kernel part, running custom cuda function also need about 70 ms , which is exactly the extre 70 ms, so I wonder if there is some way to transfer data using data pointer rather than torch.cudatensor so that I can copy data to GPU manually using cudaMemcpyAsync or something to save time?

MMMHT commented 5 years ago

Here is what my code looks like : Lua. lua

required "mylib"

local vol = torch.cudaTensor(1000,500)
local out = torch.cudaTensor(1000,500)
mylib.foo(vol,out)
--  do something to out --

mylib.cu

int foo (lua_State *L){
THCState *state = getCutorchState(L);
THCudaTensor *input = (THCudaTensor*)luaT_checkudata(L, 1, "torch.CudaTensor");    
THCudaTensor *output = (THCudaTensor*)luaT_checkudata(L, 2, "torch.CudaTensor");

foo <<<>>>(input, output)

return 1;
}