Closed cmh1027 closed 1 week ago
The workload is launched with block dimension of 32x32
. Since the convlution in SSIM uses an 11x11 kernel the size of the window loaded in the shared memory is 42x42
.
The separable convolution in X dimension takes as input a window of size 42x42
and outputs a window of size 42x32
. (Convolution in X direction reduces X dimension.) This means that there are 42x32
1D
separable convolutions happening. Hence, first all 32x32
threads do one convolution. Then only the first 10x32
threads to the remaining 10x32
. The statement local_y < SY
ensures that only the first 10 threads are doing it in the second phase.
@rahul-goel Thanks for reply. One more question please. Why is flush_conv_scratch necessary? Aren't values overwritten anyway without it?
I actually didn't correctly explain in the previous comment. I've updated it. Please have a look.
I was doing flushing in an earlier implementation where it was necessary I think. It got carried over from the previous implementation. I haven't checked whether removing it changes things or not and what performance benefits it gives.
I checked. Flushing isn't necessary. Although removing it doesn't make noticeable difference.
Closing this for now. Please re-open if necessary.
what's the purpose of recomputation of convolution value when local_y < SY?