Closed sarrvesh closed 8 years ago
For optimal use of SMs. make sure that the number of threads per block is a multiple of warpsize.
Based this (https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/#more-1443), it looks like my current implementation works. One thing to note though is to make sure that the block size is an integer multiple of the number of SMs on the device.
See this article (https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/#more-3366) to compute the gpu occupancy. A good code should a high level of occupancy.
In the current version, a single block with nPhi threads are launched by default. This is not necessarily the best option. Ideally, one should decide based on the number of registers available per MP. A good understanding of the GPU hardware is needed to solve this problem.