Closed Chr1s603 closed 1 year ago
Hey there, without this change I measure only 5ms for size 3 on my RTX 2080. With the synchronization, it's around 500ms, so this is definetely missing here.
Thanks! I think it will be better to use CUDA event to measure latency in this case. I will go ahead and update them.
Allows measuring the actual computation time of the kernels