mtazzari / galario

Gpu Accelerated Library for Analysing Radio Interferometer Observations
https://mtazzari.github.io/galario/
GNU Lesser General Public License v3.0
31 stars 15 forks source link

threadblock #52

Closed fredRos closed 7 years ago

fredRos commented 7 years ago

From the GTC presentation

GPU Performance Analysis and Optimization Paulius Micikevicius Developer Technology, NVIDIA

Threadblock size choice: – Start with 128-256 threads per block • Adjust up/down by what best matches your function • Example: stencil codes prefer larger blocks to minimize halos – Multiple of warp size (32 threads) – If occupancy is critical to performance: • Check that block size isn’t precluding occupancy allowed by register and SMEM resources Grid size: – 1,000 or more threadblocks • 10s of waves of threadblocks: no need to think about tail effect • Makes your code ready for several generations of future GPUs

So the minimum is 32, and our default should be larger. Let's see its effect once we have a stable configuration

mtazzari commented 7 years ago

I have done a few performance tests on my desktop, with a GTX 1060. We should definitely increase nthreads, at least to 64. Some numbers:

Assumed: double precision.

mtazzari commented 7 years ago

I found here a nice compendium with maximum threads per block for different cards. https://en.wikipedia.org/wiki/CUDA

For GTX 1060 compute capability is 6.1 -> maximum threads per block is 1024.

fredRos commented 7 years ago

With P100 and double precision, I get

mtazzari commented 7 years ago

Unfortunately, I have found that for nthreads larger than 32 many tests fail...the code returns NaN in many places. To be investigated.

mtazzari commented 7 years ago

We have investigated this and found that 1024 is the maximum threads per block for the computing capability > 2.x. Kernels operating on 2d objects, e.g.:

Kernel<<<dim3(size/tpb +1, size/tpb +1), dim3(tpb, tpb)>>>(arg1, arg2, ...);

have a blocksize tpb*tpb, therefore 32 is the maximum tpb that can be used (gives 1024 threads per block).

Therefore, kernels operating on 1d objects can be launched with:

Kernel<<<size/(tpb*tpb) +1, tpb*tpb>>>(arg1, arg2, ...);

to maximize the occupancy. In r2c branch we implemented this in 039b42f3c0b1f1a72c80939564248dad41403797.