This is a special version for the current transpose operator. The current transpose operator will handle a general N-dimension transpose, while this PR implement a 2D version to speed up 2D transpose.
Thread coarsening and (static) shared memory have been used.
Benchmark result:
This is a special version for the current transpose operator. The current transpose operator will handle a general N-dimension transpose, while this PR implement a 2D version to speed up 2D transpose. Thread coarsening and (static) shared memory have been used. Benchmark result: