threadblock - Githubissues

mtazzari / galario

Gpu Accelerated Library for Analysing Radio Interferometer Observations

https://mtazzari.github.io/galario/

GNU Lesser General Public License v3.0

31 stars 15 forks source link

threadblock #52

Closed fredRos closed 7 years ago

fredRos commented 7 years ago

From the GTC presentation

GPU Performance Analysis and Optimization Paulius Micikevicius Developer Technology, NVIDIA

Threadblock size choice: – Start with 128-256 threads per block • Adjust up/down by what best matches your function • Example: stencil codes prefer larger blocks to minimize halos – Multiple of warp size (32 threads) – If occupancy is critical to performance: • Check that block size isn’t precluding occupancy allowed by register and SMEM resources Grid size: – 1,000 or more threadblocks • 10s of waves of threadblocks: no need to think about tail effect • Makes your code ready for several generations of future GPUs

So the minimum is 32, and our default should be larger. Let's see its effect once we have a stable configuration

mtazzari commented 7 years ago

I have done a few performance tests on my desktop, with a GTX 1060. We should definitely increase nthreads, at least to 64. Some numbers:

matrix size 4096, nthreads 32, time(chi2) 100ms
matrix size 4096, nthreads 64, time(chi2) 62ms
matrix size 8192, nthreads 32, time(chi2) 400ms
matrix size 8192, nthreads 64, time(chi2) 210ms
larger nthreads (tried 128, 256, 512, 1024) seem not to change the time.

Assumed: double precision.

mtazzari commented 7 years ago

I found here a nice compendium with maximum threads per block for different cards. https://en.wikipedia.org/wiki/CUDA

For GTX 1060 compute capability is 6.1 -> maximum threads per block is 1024.

fredRos commented 7 years ago

With P100 and double precision, I get

4k, 32 threads, 57 ms
4k, 64 threads, 39 ms
4k, 128 threads, 42 ms
8k, 32 threads, 307 ms
8k, 64 threads, 133 ms
8k, 128 threads, 96 ms
8k, 256 threads, 131 ms
16k, 32 threads, ms
16k, 64 threads, ms
16k, 128 threads, 480 ms
16k, 256 threads, 487 ms

mtazzari commented 7 years ago

Unfortunately, I have found that for nthreads larger than 32 many tests fail...the code returns NaN in many places. To be investigated.

mtazzari commented 7 years ago

We have investigated this and found that 1024 is the maximum threads per block for the computing capability > 2.x. Kernels operating on 2d objects, e.g.:

Kernel<<<dim3(size/tpb +1, size/tpb +1), dim3(tpb, tpb)>>>(arg1, arg2, ...);

have a blocksize tpb*tpb, therefore 32 is the maximum tpb that can be used (gives 1024 threads per block).

Therefore, kernels operating on 1d objects can be launched with:

Kernel<<<size/(tpb*tpb) +1, tpb*tpb>>>(arg1, arg2, ...);

to maximize the occupancy. In r2c branch we implemented this in 039b42f3c0b1f1a72c80939564248dad41403797.