Closed fredRos closed 7 years ago
I have done a few performance tests on my desktop, with a GTX 1060. We should definitely increase nthreads, at least to 64. Some numbers:
Assumed: double precision.
I found here a nice compendium with maximum threads per block for different cards. https://en.wikipedia.org/wiki/CUDA
For GTX 1060 compute capability is 6.1 -> maximum threads per block is 1024.
With P100 and double precision, I get
Unfortunately, I have found that for nthreads larger than 32 many tests fail...the code returns NaN in many places. To be investigated.
We have investigated this and found that 1024 is the maximum threads per block for the computing capability > 2.x. Kernels operating on 2d objects, e.g.:
Kernel<<<dim3(size/tpb +1, size/tpb +1), dim3(tpb, tpb)>>>(arg1, arg2, ...);
have a blocksize tpb*tpb, therefore 32 is the maximum tpb that can be used (gives 1024 threads per block).
Therefore, kernels operating on 1d objects can be launched with:
Kernel<<<size/(tpb*tpb) +1, tpb*tpb>>>(arg1, arg2, ...);
to maximize the occupancy. In r2c branch we implemented this in 039b42f3c0b1f1a72c80939564248dad41403797.
From the GTC presentation
GPU Performance Analysis and Optimization Paulius Micikevicius Developer Technology, NVIDIA
So the minimum is 32, and our default should be larger. Let's see its effect once we have a stable configuration