Open XiaotaoChen opened 4 years ago
This is a very old issue, I hope you figured it out. I also faced the same problem recently. The author also noted about this issue in the README.
For some unknown reason (for now), incorporating loop unrolling in the implementation makes the Debug compiled code output the correct values, while the Release compiled code outputs incorrect values.
What I think was happening is that, without marking shared memory as volatile, the compiler optimizes away the memory read. In other words, it assumes that sA[tid + 16], sA[tid + 8], ...
has not changed, thus it will load all of them at the start and add to the results. This is why this problem only occurs in Release mode.
Apart from calling warpFunc()
, which marks the pointer as volatile, I found that simply re-cast the pointer to volatile also works.
volatile float *_shmem = shmem;
for (int stride = warpSize; stride > 0; stride /= 2)
_shmem[tid] += _shmem[tid + stride];
Calling __syncthreads()
also works as you did in your version 3, since it will make a memory read. __syncwarp()
is also possible from my experiments.
Hope this also helps other people!
Hi, I'm learning the reduction of cuda with nvidia doc, I unroll the last warp with device funciton
WarpReduce
, the result is correct. however, the result is wrong when i replacewarpReduce
with the real code blocks. According to your code inreduce.cu
, you used the code blocks instead ofwarpReduce
function. And i test your code, result is correct. i don't what's wrong with my code. can you review my code to find out what's wrong about my code. Thanks. the test code and it's output as belows. cuda codeit's output