Thanks for sharing the codes, which are good examples of how to implement and access 4D tensors (like, in size [N, C, H, W]) in a low-level CUDA kernel (e.g., idx=blockIdx.x * blockDim.x + threadIdx.x).
I would like to share the following:
Uncomment the macro definitions CHECK_CUDA, CHECK_CONTIGUOUS, and CHECK_INPUT. Otherwise, you probably get incorrect results when you run the test/test.py. Since the tensors generated by e.g., im0 = torch.FloatTensor(im0).permute(0, 3, 1, 2) are not contiguous. Use im0 = torch.FloatTensor(im0).permute(0, 3, 1, 2).contiguous() instead.
Compilation Deprecation Warning: the solution includes:
a) In cuda/forward_warp_cuda.cpp file: change "#define CHECK_CUDA(x) TORCH_CHECK(x.type().is_cuda(), ..." to "#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), ...";
b) In cuda/forward_warp_cuda_kernel.cu file: change "AT_DISPATCH_FLOATING_TYPES(im0.type(), ..." to "AT_DISPATCH_FLOATING_TYPES(im0.scalar_type(), ...", and im0.data<scalar_t>() to im0.data_ptr<scalar_t>();
Now I can compile the CUDA code and get the correct results when running test/test.py.
Thanks for sharing the codes, which are good examples of how to implement and access 4D tensors (like, in size [N, C, H, W]) in a low-level CUDA kernel (e.g., idx=blockIdx.x * blockDim.x + threadIdx.x).
I would like to share the following:
Uncomment the macro definitions
CHECK_CUDA
,CHECK_CONTIGUOUS
, andCHECK_INPUT
. Otherwise, you probably get incorrect results when you run thetest/test.py
. Since the tensors generated by e.g.,im0 = torch.FloatTensor(im0).permute(0, 3, 1, 2)
are not contiguous. Useim0 = torch.FloatTensor(im0).permute(0, 3, 1, 2).contiguous()
instead.Compilation Deprecation Warning: the solution includes:
cuda/forward_warp_cuda.cpp
file: change "#define CHECK_CUDA(x) TORCH_CHECK(x.type().is_cuda()
, ..." to "#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda()
, ...";cuda/forward_warp_cuda_kernel.cu
file: change "AT_DISPATCH_FLOATING_TYPES(im0.type()
, ..." to "AT_DISPATCH_FLOATING_TYPES(im0.scalar_type()
, ...", andim0.data<scalar_t>()
toim0.data_ptr<scalar_t>()
;Now I can compile the CUDA code and get the correct results when running
test/test.py
.