sarrvesh / cuFFS

A GPU-accelerated Rotation Measure Synthesis code
GNU General Public License v2.0
8 stars 4 forks source link

Optimize thread and block size #14

Closed sarrvesh closed 8 years ago

sarrvesh commented 8 years ago

In the current version, a single block with nPhi threads are launched by default. This is not necessarily the best option. Ideally, one should decide based on the number of registers available per MP. A good understanding of the GPU hardware is needed to solve this problem.

sarrvesh commented 8 years ago

For optimal use of SMs. make sure that the number of threads per block is a multiple of warpsize.

sarrvesh commented 8 years ago

Based this (https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/#more-1443), it looks like my current implementation works. One thing to note though is to make sure that the block size is an integer multiple of the number of SMs on the device.

sarrvesh commented 8 years ago

See this article (https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/#more-3366) to compute the gpu occupancy. A good code should a high level of occupancy.