Transpose does not scale well with multithread

Using Dual Intel Xeon Gold 6154 on commit 990e59f.

Compilation flags used: nim cpp --passC:"-D_GNU_SOURCE" --passL:"-lpthread" -r -d:release -d:openmp -o:build/bench_transpose benchmarks/transpose/transpose_bench.nim

Multithreaded results:

Hint: ./build/bench_transpose  [Exec]
Warmup: 0.9945 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations:     8.000 millions
Required bytes:                   32.000 MB
Arithmetic intensity:              0.250 FLOP/byte

Laser ForEachStrided
Collected 250 samples in 0.500 seconds
Average time: 1.518 ms
Stddev  time: 2.158 ms
Min     time: 1.153 ms
Max     time: 25.356 ms
Perf:         5.271 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose
Collected 250 samples in 0.266 seconds
Average time: 1.062 ms
Stddev  time: 0.418 ms
Min     time: 0.936 ms
Max     time: 3.818 ms
Perf:         7.530 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose - input row iteration
Collected 250 samples in 0.400 seconds
Average time: 1.598 ms
Stddev  time: 2.117 ms
Min     time: 0.969 ms
Max     time: 23.107 ms
Perf:         5.006 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP
Collected 250 samples in 0.411 seconds
Average time: 1.642 ms
Stddev  time: 2.530 ms
Min     time: 0.924 ms
Max     time: 31.653 ms
Perf:         4.871 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP - input row iteration
Collected 250 samples in 0.445 seconds
Average time: 1.781 ms
Stddev  time: 2.011 ms
Min     time: 1.162 ms
Max     time: 24.661 ms
Perf:         4.492 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking
Collected 250 samples in 0.068 seconds
Average time: 0.270 ms
Stddev  time: 0.222 ms
Min     time: 0.239 ms
Max     time: 2.669 ms
Perf:         29.637 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking - input row iteration
Collected 250 samples in 0.179 seconds
Average time: 0.715 ms
Stddev  time: 0.279 ms
Min     time: 0.657 ms
Max     time: 3.240 ms
Perf:         11.184 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling
Collected 250 samples in 0.066 seconds
Average time: 0.265 ms
Stddev  time: 0.159 ms
Min     time: 0.241 ms
Max     time: 2.447 ms
Perf:         30.189 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling - input row iteration
Collected 250 samples in 0.056 seconds
Average time: 0.223 ms
Stddev  time: 0.095 ms
Min     time: 0.203 ms
Max     time: 1.459 ms
Perf:         35.896 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking with Prefetch
Collected 250 samples in 0.069 seconds
Average time: 0.277 ms
Stddev  time: 0.160 ms
Min     time: 0.252 ms
Max     time: 2.446 ms
Perf:         28.844 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling + Prefetch - input row iteration
Collected 250 samples in 0.175 seconds
Average time: 0.698 ms
Stddev  time: 1.759 ms
Min     time: 0.371 ms
Max     time: 18.627 ms
Perf:         11.455 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Production implementation
Collected 250 samples in 0.144 seconds
Average time: 0.574 ms
Stddev  time: 0.975 ms
Min     time: 0.382 ms
Max     time: 12.650 ms
Perf:         13.933 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Without OpenMP: nim cpp --passC:"-D_GNU_SOURCE" --passL:"-lpthread" -r -d:release -o:build/bench_transpose benchmarks/transpose/transpose_bench.nim

Singlethreaded results:

Hint: ./build/bench_transpose  [Exec]
Warmup: 0.9940 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations:     8.000 millions
Required bytes:                   32.000 MB
Arithmetic intensity:              0.250 FLOP/byte

Laser ForEachStrided
Collected 250 samples in 9.080 seconds
Average time: 35.957 ms
Stddev  time: 0.289 ms
Min     time: 35.666 ms
Max     time: 37.249 ms
Perf:         0.222 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose
Collected 250 samples in 8.580 seconds
Average time: 34.320 ms
Stddev  time: 0.320 ms
Min     time: 32.876 ms
Max     time: 35.604 ms
Perf:         0.233 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose - input row iteration
Collected 250 samples in 8.637 seconds
Average time: 34.549 ms
Stddev  time: 0.243 ms
Min     time: 34.378 ms
Max     time: 35.767 ms
Perf:         0.232 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP
Collected 250 samples in 8.674 seconds
Average time: 34.695 ms
Stddev  time: 0.361 ms
Min     time: 33.291 ms
Max     time: 36.134 ms
Perf:         0.231 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP - input row iteration
Collected 250 samples in 8.694 seconds
Average time: 34.775 ms
Stddev  time: 0.339 ms
Min     time: 34.471 ms
Max     time: 36.496 ms
Perf:         0.230 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking
Collected 250 samples in 2.383 seconds
Average time: 9.533 ms
Stddev  time: 0.172 ms
Min     time: 9.345 ms
Max     time: 10.990 ms
Perf:         0.839 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking - input row iteration
Collected 250 samples in 4.512 seconds
Average time: 18.047 ms
Stddev  time: 0.232 ms
Min     time: 17.833 ms
Max     time: 19.423 ms
Perf:         0.443 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling
Collected 250 samples in 3.625 seconds
Average time: 14.498 ms
Stddev  time: 0.236 ms
Min     time: 14.244 ms
Max     time: 15.882 ms
Perf:         0.552 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling - input row iteration
Collected 250 samples in 2.491 seconds
Average time: 9.964 ms
Stddev  time: 0.222 ms
Min     time: 9.820 ms
Max     time: 11.652 ms
Perf:         0.803 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking with Prefetch
Collected 250 samples in 2.583 seconds
Average time: 10.331 ms
Stddev  time: 0.169 ms
Min     time: 9.836 ms
Max     time: 11.829 ms
Perf:         0.774 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling + Prefetch - input row iteration
Collected 250 samples in 2.699 seconds
Average time: 10.796 ms
Stddev  time: 0.216 ms
Min     time: 10.669 ms
Max     time: 12.463 ms
Perf:         0.741 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Production implementation
Collected 250 samples in 2.712 seconds
Average time: 10.849 ms
Stddev  time: 0.181 ms
Min     time: 10.708 ms
Max     time: 12.350 ms
Perf:         0.737 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

mratsim / laser

Transpose does not scale well with multithread #13