mratsim / weave

A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead
Other
535 stars 22 forks source link

BLAS matmul vs Laser, Intel MKL, MKL-DNN, OpenBLAS and co #68

Closed mratsim closed 4 years ago

mratsim commented 4 years ago

Somewhat related to #31 and very probably related to #35

~By giving Laser (OpenMP-based) the time to get CPU at full-speed, it can reach 2.23 TFlops on my CPU:~ see below, nimsuggest was running at 100%

Backend:                        Laser (Pure Nim) + OpenMP
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Laser production implementation
Collected 300 samples in 2245 ms
Average time: 6.327 ms
Stddev  time: 2.929 ms
Min     time: 5.000 ms
Max     time: 48.000 ms
Perf:         2237.478 GFLOP/s

Intel MKL is at 2.85 TFlops

Backend:                        Intel MKL
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Intel MKL benchmark
Collected 300 samples in 1842 ms
Average time: 4.970 ms
Stddev  time: 3.956 ms
Min     time: 4.000 ms
Max     time: 70.000 ms
Perf:         2848.245 GFLOP/s

OpenBLAS is at 1.52 TFlops

Backend:                        OpenBLAS
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

OpenBLAS benchmark
Collected 300 samples in 3124 ms
Average time: 9.300 ms
Stddev  time: 2.637 ms
Min     time: 9.000 ms
Max     time: 40.000 ms
Perf:         1522.126 GFLOP/s

Unfortunately Weave is stuck at 0.6 TFlops.

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 7686 ms
Average time: 24.553 ms
Stddev  time: 4.426 ms
Min     time: 17.000 ms
Max     time: 60.000 ms
Perf:         576.532 GFLOP/s

First look

image

On the right more than half of the instructions for Weave are not floating points instructions. Also it seems like the CPU frequency is at 2.7GHz in the case of Weave while Laser reaches 3.6GHz. My BIOS settings are 4.1GHz all core turbo for normal code, 4.0GHz for AVX2 and 3.5GHz for AVX512 code.

Call stack

image

Looking into the call stack, it seems like we are paying the price of dynamic scheduling :/ as the time spend into the GEMM kernel is similar.

Conclusion

We have very high overhead for coarse-grained parallelism where OpenMP shines. I suspect this is also the reason behind #35.

Suspicions

Additionally I suspect the runtime is polluting the L1 cache with runtime data while GEMM has been carefully tuned to take advantage of L1 and L2 caches. Using smaller tiles might help (similarly avoiding hyperthreading which does the same would probably help as well)

mratsim commented 4 years ago

See #69

Laser can now reach 2.7 TFlops with GCC: static, guided and dynamic schedule. Clang is at 2.1~2.2GHz.

A nimsuggest process was probably hogging my CPU

Backend:                        Laser (Pure Nim) + OpenMP
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Laser production implementation
Collected 300 samples in 1961 ms
Average time: 5.280 ms
Stddev  time: 2.303 ms
Min     time: 5.000 ms
Max     time: 36.000 ms
Perf:         2681.018 GFLOP/s
mratsim commented 4 years ago

Some consolation: Intel MKL build with TBB misses the 1TFlops mark as well:

Backend:                        Intel MKL
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Intel MKL benchmark
Collected 300 samples in 4576 ms
Average time: 14.217 ms
Stddev  time: 3.396 ms
Min     time: 8.000 ms
Max     time: 46.000 ms
Perf:         995.717 GFLOP/s

Change mkl_gemm.nim.cfg to:

clibdir:"/opt/intel/mkl/lib/intel64"
passl:"/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.a"
passl:"-lmkl_core"
passl:"-lmkl_tbb_thread"
passl:"-ltbb"
dynlibOverride:"mkl_intel_lp64"

but it reaches 3.2 TFlops with Intel OpenMP

Backend:                        Intel MKL
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Intel MKL benchmark
Collected 300 samples in 1698 ms
Average time: 4.420 ms
Stddev  time: 2.937 ms
Min     time: 4.000 ms
Max     time: 53.000 ms
Perf:         3202.664 GFLOP/s
mratsim commented 4 years ago

Main issue fixed in 36507ec by exposing more parallelism by parallelizing 3 nested loops instead of just 2. Perf is now 1.8 to 2TFlops on my machine which is much more reasonable:

Important: Just adding loadBalance(Weave) instead of parallelizing the inner loop is not enough.

Stats

Before

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 9341 ms
Average time: 30.083 ms
Stddev  time: 7.052 ms
Min     time: 20.000 ms
Max     time: 75.000 ms
Perf:         470.552 GFLOP/s
Worker Exec task send recv steal requests send receive tasks enque dequeue tesks idle total
Timer 31 67162.607 584945.912 5808.580 378.542 31300.682 689596.323
Timer 6 151208.192 384442.585 25177.179 551.339 132284.510 693663.805
Timer 3 118192.217 462848.264 18416.713 694.853 96261.720 696413.767
Timer 33 84428.756 568342.303 5946.973 516.956 31725.726 690960.713
Timer 29 194235.677 463822.996 6026.209 891.891 32493.478 697470.251
Timer 12 147331.527 426970.998 18147.269 590.404 94001.973 687042.171
Timer 26 206079.226 451966.946 6387.025 1081.511 34710.491 700225.199
Timer 13 119520.727 442308.723 19510.054 450.167 103730.795 685520.465
Timer 25 197890.720 455604.535 6888.388 979.174 36978.200 698341.016
Timer 27 195427.110 462994.971 6229.360 978.685 34077.716 699707.842
Timer 4 100525.443 457112.698 21240.117 567.968 110874.055 690320.281
Timer 30 195968.538 460742.005 6234.480 827.435 33412.842 697185.300
Timer 7 84923.680 506766.125 15309.847 280.326 81109.847 688389.825
Timer 24 182246.102 468302.946 6579.388 570.952 35535.514 693234.901
Timer 5 159109.065 387777.447 23667.322 599.713 124177.060 695330.607
Timer 32 60194.238 589304.610 6176.448 275.104 32228.432 688178.832
Timer 19 122937.480 518342.104 7731.387 408.553 41672.433 691091.958
Timer 14 117765.172 442969.256 19849.724 443.428 104462.701 685490.281
Timer 20 108203.168 529161.627 8009.311 281.924 43320.310 688976.340
Timer 10 81186.363 499063.640 16709.611 279.777 86853.296 684092.688
Timer 22 126609.674 516353.114 7581.768 571.655 41083.725 692199.936
Timer 34 80312.832 576296.029 5252.064 415.060 27558.386 689834.370
Timer 35 88949.289 563356.530 5890.700 545.931 33161.699 691904.150
Timer 23 189875.043 465690.975 6054.358 622.547 32874.200 695117.124
Timer 8 78706.643 510926.490 15427.789 291.401 79937.033 685289.356
Timer 11 141243.889 443054.503 16606.129 564.815 85750.634 687219.970
Timer 9 84888.088 500973.331 15900.617 313.543 82549.751 684625.330
Timer 18 95157.213 541357.110 7966.062 165.602 42892.018 687538.005
Timer 1 177358.475 402236.309 19697.230 1334.880 103265.045 703891.938
Timer 15 62995.317 558259.037 10802.165 515.058 54284.070 686855.647
Timer 2 228236.442 313500.180 24887.463 933.521 131643.930 699201.537
Timer 0 112908.708 371233.642 41056.940 207.885 219580.684 744987.859
Timer 21 122821.680 520134.103 7515.099 527.000 40609.517 691607.399
Timer 16 67990.787 549372.248 11306.038 418.100 57625.057 686712.230
Timer 28 190540.991 465020.506 6446.193 940.650 34632.296 697580.635
Timer 17 87072.270 546874.811 9151.320 517.453 46577.786 690193.639

After

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 2606 ms
Average time: 7.730 ms
Stddev  time: 4.088 ms
Min     time: 6.000 ms
Max     time: 49.000 ms
Perf:         1831.278 GFLOP/s
Worker Exec task send recv steal requests send receive tasks enque dequeue tesks idle total
Timer 14 109369.554 54669.369 4734.948 504.536 24037.765 193316.173
Timer 1 114960.762 48428.423 5055.483 908.356 24857.134 194210.158
Timer 11 110646.820 54330.393 4764.520 477.942 23372.801 193592.475
Timer 12 109773.613 54253.606 4813.124 524.362 23993.648 193358.353
Timer 28 105961.124 62151.888 4049.457 385.197 20934.560 193482.226
Timer 27 106384.220 62214.171 4098.170 430.601 20985.437 194112.600
Timer 29 106537.063 59922.506 4316.233 383.995 22539.094 193698.891
Timer 9 103471.652 61155.561 4743.777 394.963 23428.874 193194.827
Timer 24 107585.847 59358.275 4322.620 395.578 22084.356 193746.676
Timer 3 109124.665 54916.677 4961.589 563.696 24541.015 194107.642
Timer 15 98598.742 68661.446 4415.255 354.507 21791.132 193821.082
Timer 6 117904.793 46353.822 4907.058 699.744 24008.840 193874.256
Timer 10 102731.402 61727.425 4765.494 393.112 23503.577 193121.010
Timer 21 101140.485 66311.005 4125.495 307.755 21271.194 193155.934
Timer 30 106786.435 59814.345 4322.485 416.065 22394.376 193733.705
Timer 18 100103.064 69202.506 3913.947 309.698 20024.074 193553.289
Timer 34 95681.374 74938.894 3720.593 244.018 19519.236 194104.115
Timer 33 95485.940 74077.942 3730.041 258.876 19686.630 193239.429
Timer 0 106869.264 33199.361 6972.569 499.693 35513.367 183054.254
Timer 4 109051.911 54881.134 4907.741 558.008 24225.744 193624.538
Timer 19 101141.742 66311.351 4248.495 307.546 21939.490 193948.624
Timer 26 106838.971 60474.659 4231.349 438.965 21670.470 193654.415
Timer 17 100023.696 68273.456 4243.542 368.127 20953.964 193862.785
Timer 25 106151.732 59627.085 4462.027 402.793 22925.111 193568.747
Timer 16 98058.453 69144.517 4392.969 375.358 21656.227 193627.525
Timer 35 97473.807 73575.877 3740.237 257.905 19450.109 194497.935
Timer 2 128860.410 38191.806 4546.105 1424.634 21788.005 194810.960
Timer 31 95783.863 74022.193 3570.980 245.846 20479.436 194102.319
Timer 7 102732.876 61364.221 4931.404 418.762 24432.209 193879.473
Timer 5 118251.584 46295.838 4809.459 783.144 23747.035 193887.060
Timer 20 100742.140 66050.523 4239.576 299.596 22101.823 193433.658
Timer 8 102344.588 62243.333 4782.607 423.864 23509.571 193303.964
Timer 23 107622.646 59494.391 4287.997 401.905 22084.048 193890.986
Timer 13 109745.142 56445.965 4528.521 536.654 22222.765 193479.046
Timer 32 96283.180 74841.103 3635.348 260.648 18624.449 193644.728
Timer 22 101762.061 67560.103 3919.838 309.118 20105.694 193656.813
mratsim commented 4 years ago

Very interesting and thorough analysis of TBB overhead vs static scheduling: http://www.cs.yale.edu/homes/abhishek/abhishek-taco11.pdf

mratsim commented 4 years ago

After #74 and changing work-stealing victim selection, we can now reach 2.3 TFlops. Note: the change benefits coarse-grain parallelism but hurts fine-grain parallelism performance. Though Weave is already excellent on that front.

Interesting readings:

image

mratsim commented 4 years ago

~Experiments with WV_MaxConcurrentStealPerWorker and WV_StealEarly shows that the current default offer the best performance. Changing both at the same time can even lead to a livelock with task ping-pong.~ Edit, StealEarly was not implemented, fixed in 17885dd

The idea was that workers would get a huge chunk of work upfront (similar to eager binary splitting or OpenMP scheduling) and wouldn't have to find more later.

After implementation the following configuration is stably at 2.28~2.30TFlops

mratsim commented 4 years ago

Some ideas:

Rejected

  1. Using a private and public deque similar to Wool and Lace: https://research.utwente.nl/files/5346297/lace.pdf / https://github.com/trolando/lace as descrbied in Dinan paper Scalable Work-Stealing: https://www.researchgate.net/publication/220781636_Scalable_work_stealing

This would give up the main benefit of Weave which is being purely channel based and easier to extend with distributed computing, easier to formally verify and future hardware proof, assuming we will have in the future more cluster-on-chips and that hardware-message passing primitives will be available but memory coherency will be too expensive to maintain (i.e. very costly atomics). (for example Mellanox/Tilera 100+ARM cores, Intel SCC, Adapteva 1024 ARM cores)

  1. Interrupts.

Interrupts might be tricky for embedding in other programs, say Python. It will also flush the cache at an inopportune time (when the core is doing carefully tuned AVX512 instructions to use 100% of the CPU cycles)

Ideas

Similar to Latency hiding work-stealing: http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-112R.pdf it might want to have another private queue for statically scheduled tasks with a static parallelFor.

Tasks sent in that queue cannot be stolen and there would be no load balancing. A "schedule: static" field could be added to the parallelFor construct.

This would allow:

This can be extended to:

Both would help for IO tasks #22 with long-running event-loops on the static queue, they may spawn multithreaded computation on the work-stealing queue. Properly designing this for composition would allow not having to hardcode an event-loop in Weave, meaning it could be extended to anything that requires a long-running polling thread.

Interesting logs

When scheduling the following nested loops and trying to see if the Worker 0 distributes the i loop first (outer loop, more work sent per i iteration) or the j loop first (inner loop, less work sent per j iteration):

import
  ../weave,
  ../weave/instrumentation/loggers,
  ../weave/contexts

init(Weave)

parallelForStrided i in 0 ..< 2000, stride = 256:
  parallelForStrided j in 0 ..< 2000, stride = 128:
    captures: {i}
    log("Matrix[%d, %d] (thread %d)\n", i, j, myID())

exit(Weave)

Logs are:

Worker 14: sending own steal request to  5 (Channel 0xf4a70f90)
Worker 35: sending own steal request to 31 (Channel 0xf4a717b0)
Worker  1: sending own steal request to 35 (Channel 0xf4a718f0)
Worker  0: scheduling task.fn 0xf3aff4d4
Worker 29: sending own steal request to 20 (Channel 0xf4a71440)
Worker 16: sending own steal request to  4 (Channel 0xf4a70f40)
Worker 23: sending own steal request to 16 (Channel 0xf4a71300)
Worker  2: sending own steal request to  7 (Channel 0xf4a71030)
Worker 20: sending own steal request to 12 (Channel 0xf4a711c0)
Worker 31: sending own steal request to  2 (Channel 0xf4a70ea0)
Worker 30: sending own steal request to  9 (Channel 0xf4a710d0)
Worker 34: sending own steal request to 35 (Channel 0xf4a718f0)
Worker 32: sending own steal request to 25 (Channel 0xf4a715d0)
Worker 17: sending own steal request to 12 (Channel 0xf4a711c0)
Worker 26: sending own steal request to  9 (Channel 0xf4a710d0)
Worker  8: sending own steal request to 14 (Channel 0xf4a71260)
Worker 15: sending own steal request to  8 (Channel 0xf4a71080)
Worker 33: sending own steal request to 10 (Channel 0xf4a71120)
Worker 28: sending own steal request to 22 (Channel 0xf4a714e0)
Worker 10: sending own steal request to 23 (Channel 0xf4a71530)
Worker  7: sending own steal request to 25 (Channel 0xf4a715d0)
Worker 13: sending own steal request to 35 (Channel 0xf4a718f0)
Worker 25: sending own steal request to 29 (Channel 0xf4a71710)
Worker  9: sending own steal request to  7 (Channel 0xf4a71030)
Worker 27: sending own steal request to 35 (Channel 0xf4a718f0)
Worker  6: sending own steal request to 25 (Channel 0xf4a715d0)
Worker 24: sending own steal request to 28 (Channel 0xf4a716c0)
Worker 19: sending own steal request to 24 (Channel 0xf4a71580)
Worker  4: sending own steal request to 11 (Channel 0xf4a71170)
Worker 22: sending own steal request to  6 (Channel 0xf4a70fe0)
Worker  3: sending own steal request to 15 (Channel 0xf4a712b0)
Worker 21: sending own steal request to 16 (Channel 0xf4a71300)
Worker 11: sending own steal request to 20 (Channel 0xf4a71440)
Worker  0: preparing 1 task(s) for worker 24 with function address 0xf3aff4d4
Worker  0: sending 1 tasks (task.fn 0xf3aff4d4) to Worker 24
Worker 24: received a task with function address 0xf3aff4d4 (Channel 0x90000b60)
>>> Worker  0 enters barrier <<<
Worker  0: globalsync 1 - task from local deque
Worker 24: running task.fn 0xf3aff4d4
Worker  0: globalsync 2 - becoming a thief
Worker  0: sending own steal request to  2 (Channel 0xf4a70ea0)
Worker 12: sending own steal request to  0 (Channel 0xf4a70e00)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 22 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 22
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[256, 0] (thread 23)
Matrix[256, 128] (thread 23)
Matrix[256, 256] (thread 23)
Matrix[256, 384] (thread 23)
Matrix[256, 512] (thread 23)
Matrix[256, 640] (thread 23)
Matrix[256, 768] (thread 23)
Matrix[256, 896] (thread 23)
Matrix[256, 1024] (thread 23)
Matrix[256, 1152] (thread 23)
Matrix[256, 1280] (thread 23)
Matrix[256, 1408] (thread 23)
Matrix[256, 1536] (thread 23)
Matrix[256, 1664] (thread 23)
Matrix[256, 1792] (thread 23)
Matrix[256, 1920] (thread 23)
Worker 24: preparing 1 task(s) for worker  0 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker  0
Worker  0: received a task with function address 0xf3afede7 (Channel 0xf4a74dd0)
Worker  0: globalsync 3 - stoled tasks
Worker  0: globalsync 4 - sharing work
Worker  0: globalsync 5 - working on leftover
Worker  0: running task.fn 0xf3afede7
Matrix[512, 0] (thread 0)
Matrix[512, 128] (thread 0)
Matrix[512, 256] (thread 0)
Matrix[512, 384] (thread 0)
Matrix[512, 512] (thread 0)
Matrix[512, 640] (thread 0)
Matrix[512, 768] (thread 0)
Matrix[512, 896] (thread 0)
Matrix[512, 1024] (thread 0)
Matrix[512, 1152] (thread 0)
Matrix[512, 1280] (thread 0)
Matrix[512, 1408] (thread 0)
Matrix[512, 1536] (thread 0)
Matrix[512, 1664] (thread 0)
Matrix[512, 1792] (thread 0)
Matrix[512, 1920] (thread 0)
Worker  0: globalsync 2 - becoming a thief
Worker  0: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker  0 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker  0
Worker  0: received a task with function address 0xf3afede7 (Channel 0xf4a74dd0)
Worker  0: globalsync 3 - stoled tasks
Worker  0: globalsync 4 - sharing work
Worker  0: globalsync 5 - working on leftover
Worker  0: running task.fn 0xf3afede7
Matrix[768, 0] (thread 0)
Matrix[768, 128] (thread 0)
Matrix[768, 256] (thread 0)
Matrix[768, 384] (thread 0)
Matrix[768, 512] (thread 0)
Matrix[768, 640] (thread 0)
Matrix[768, 768] (thread 0)
Matrix[768, 896] (thread 0)
Matrix[768, 1024] (thread 0)
Matrix[768, 1152] (thread 0)
Matrix[768, 1280] (thread 0)
Matrix[768, 1408] (thread 0)
Matrix[768, 1536] (thread 0)
Matrix[768, 1664] (thread 0)
Matrix[768, 1792] (thread 0)
Matrix[768, 1920] (thread 0)
Worker  0: globalsync 2 - becoming a thief
Worker  0: sending own steal request to 24 (Channel 0xf4a71580)
Worker 23: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker  0 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker  0
Worker  0: received a task with function address 0xf3afede7 (Channel 0xf4a74dd0)
Worker  0: globalsync 3 - stoled tasks
Worker  0: globalsync 4 - sharing work
Worker  0: globalsync 5 - working on leftover
Worker  0: running task.fn 0xf3afede7
Matrix[1024, 0] (thread 0)
Matrix[1024, 128] (thread 0)
Matrix[1024, 256] (thread 0)
Matrix[1024, 384] (thread 0)
Worker 22: received a task with function address 0xf3afede7 (Channel 0xac000b60)
Worker 22: running task.fn 0xf3afede7
Matrix[1024, 512] (thread 0)
Matrix[1024, 640] (thread 0)
Matrix[1024, 768] (thread 0)
Matrix[1024, 896] (thread 0)
Matrix[1024, 1024] (thread 0)
Matrix[1024, 1152] (thread 0)
Matrix[1024, 1280] (thread 0)
Matrix[1024, 1408] (thread 0)
Matrix[1024, 1536] (thread 0)
Matrix[1024, 1664] (thread 0)
Matrix[1024, 1792] (thread 0)
Matrix[1024, 1920] (thread 0)
Worker  0: globalsync 2 - becoming a thief
Worker  0: sending own steal request to 24 (Channel 0xf4a71580)
Matrix[0, 0] (thread 22)
Matrix[0, 128] (thread 22)
Matrix[0, 256] (thread 22)
Matrix[0, 384] (thread 22)
Matrix[0, 512] (thread 22)
Matrix[0, 640] (thread 22)
Matrix[0, 768] (thread 22)
Matrix[0, 896] (thread 22)
Matrix[0, 1024] (thread 22)
Matrix[0, 1152] (thread 22)
Matrix[0, 1280] (thread 22)
Matrix[0, 1408] (thread 22)
Matrix[0, 1536] (thread 22)
Matrix[0, 1664] (thread 22)
Matrix[0, 1792] (thread 22)
Matrix[0, 1920] (thread 22)
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[1280, 0] (thread 23)
Matrix[1280, 128] (thread 23)
Matrix[1280, 256] (thread 23)
Worker 22: sending own steal request to 24 (Channel 0xf4a71580)
Matrix[1280, 384] (thread 23)
Matrix[1280, 512] (thread 23)
Matrix[1280, 640] (thread 23)
Matrix[1280, 768] (thread 23)
Matrix[1280, 896] (thread 23)
Matrix[1280, 1024] (thread 23)
Matrix[1280, 1152] (thread 23)
Matrix[1280, 1280] (thread 23)
Matrix[1280, 1408] (thread 23)
Matrix[1280, 1536] (thread 23)
Matrix[1280, 1664] (thread 23)
Matrix[1280, 1792] (thread 23)
Matrix[1280, 1920] (thread 23)
Worker 23: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[1536, 0] (thread 23)
Matrix[1536, 128] (thread 23)
Matrix[1536, 256] (thread 23)
Matrix[1536, 384] (thread 23)
Matrix[1536, 512] (thread 23)
Matrix[1536, 640] (thread 23)
Matrix[1536, 768] (thread 23)
Matrix[1536, 896] (thread 23)
Matrix[1536, 1024] (thread 23)
Matrix[1536, 1152] (thread 23)
Matrix[1536, 1280] (thread 23)
Matrix[1536, 1408] (thread 23)
Matrix[1536, 1536] (thread 23)
Matrix[1536, 1664] (thread 23)
Matrix[1536, 1792] (thread 23)
Matrix[1536, 1920] (thread 23)
Worker 23: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[1792, 0] (thread 23)
Matrix[1792, 128] (thread 23)
Matrix[1792, 256] (thread 23)
Matrix[1792, 384] (thread 23)
Matrix[1792, 512] (thread 23)
Matrix[1792, 640] (thread 23)
Matrix[1792, 768] (thread 23)
Matrix[1792, 896] (thread 23)
Matrix[1792, 1024] (thread 23)
Matrix[1792, 1152] (thread 23)
Matrix[1792, 1280] (thread 23)
Matrix[1792, 1408] (thread 23)
Matrix[1792, 1536] (thread 23)
Matrix[1792, 1664] (thread 23)
Matrix[1792, 1792] (thread 23)
Matrix[1792, 1920] (thread 23)

The first thing we see in the logs is that when scheduling, the whole task is sent even if it's splittable https://github.com/mratsim/weave/blob/2cebb3d2a51c26c186672246d6e3da2a48bc4438/weave/scheduler.nim#L266-L289

The second thing that we see is that somehow there was no task splitting at all i.e. the following log entries are missing: https://github.com/mratsim/weave/blob/2cebb3d2a51c26c186672246d6e3da2a48bc4438/weave/victims.nim#L226-L245

mratsim commented 4 years ago

Further analysis of the logs shows a promising way to close the gap with OpenMP on GEMM (and hopefully coarse-grain non-nested parallel loops as well) with minimal change (which does not undermine the potential value of pinned tasks).

Worker 16: sending own steal request to  4 (Channel 0x1333d200)
Worker 14: sending own steal request to  5 (Channel 0x1333d300)
Worker 18: sending own steal request to 32 (Channel 0x1333ee00)
Worker 30: sending own steal request to  9 (Channel 0x1333d700)
Worker  9: sending own steal request to  7 (Channel 0x1333d500)
Worker 20: sending own steal request to 12 (Channel 0x1333da00)
Worker 11: sending own steal request to 20 (Channel 0x1333e200)
Worker 31: sending own steal request to  2 (Channel 0x1333d000)
Worker 32: sending own steal request to 25 (Channel 0x1333e700)
Worker 13: sending own steal request to 35 (Channel 0x1333f100)
Worker  0: scheduling task.fn 0x121e1159 (0 pending)
>>> Worker  0 enters barrier <<<
Worker  0: globalsync 1 - task from local deque
Worker  0: running task.fn 0x121e1159
Worker  0: scheduling task.fn 0x121e0a6c (0 pending)
Worker  0: scheduling task.fn 0x121e0a6c (1 pending)
Worker  0: scheduling task.fn 0x121e0a6c (2 pending)
Worker  1: sending own steal request to 35 (Channel 0x1333f100)
Worker 22: sending own steal request to  6 (Channel 0x1333d400)
Worker 19: sending own steal request to 24 (Channel 0x1333e600)
Worker  8: sending own steal request to 14 (Channel 0x1333dc00)
Worker 28: sending own steal request to 22 (Channel 0x1333e400)
Worker 10: sending own steal request to 23 (Channel 0x1333e500)
Worker 12: sending own steal request to  0 (Channel 0x1333ce00)
Worker 35: sending own steal request to 31 (Channel 0x1333ed00)
Worker 17: sending own steal request to 12 (Channel 0x1333da00)
Worker 27: sending own steal request to 35 (Channel 0x1333f100)
Worker 24: sending own steal request to 28 (Channel 0x1333ea00)
Worker  5: sending own steal request to 22 (Channel 0x1333e400)
Worker 23: sending own steal request to 16 (Channel 0x1333de00)
Worker  6: sending own steal request to 25 (Channel 0x1333e700)
Worker 25: sending own steal request to 29 (Channel 0x1333eb00)
Worker  7: sending own steal request to 25 (Channel 0x1333e700)
Worker  3: sending own steal request to 15 (Channel 0x1333dd00)
Worker 21: sending own steal request to 16 (Channel 0x1333de00)
Worker 15: sending own steal request to  8 (Channel 0x1333d600)
Worker 33: sending own steal request to 10 (Channel 0x1333d800)
Worker  2: sending own steal request to  7 (Channel 0x1333d500)
Worker  4: sending own steal request to 11 (Channel 0x1333d900)
Worker 29: sending own steal request to 20 (Channel 0x1333e200)
Worker  0: preparing 1 task(s) for worker 12 with function address 0x121e0a6c
Worker  0: sending 1 tasks (task.fn 0x121e0a6c) to Worker 12
Worker  0: preparing 1 task(s) for worker  5 with function address 0x121e0a6c
Worker 26: sending own steal request to  9 (Channel 0x1333d700)
Worker 12: received a task with function address 0x121e0a6c (Channel 0x34000b60)
Worker 12: running task.fn 0x121e0a6c
Matrix[0, 0] (thread 12)
Matrix[0, 128] (thread 12)
Matrix[0, 256] (thread 12)
Matrix[0, 384] (thread 12)
Matrix[0, 512] (thread 12)
Matrix[0, 640] (thread 12)
Matrix[0, 768] (thread 12)
Matrix[0, 896] (thread 12)
Matrix[0, 1024] (thread 12)
Matrix[0, 1152] (thread 12)
Matrix[0, 1280] (thread 12)
Matrix[0, 1408] (thread 12)
Matrix[0, 1536] (thread 12)
Matrix[0, 1664] (thread 12)
Matrix[0, 1792] (thread 12)
Matrix[0, 1920] (thread 12)
Worker 12: sending own steal request to  0 (Channel 0x1333ce00)
Worker 34: sending own steal request to 35 (Channel 0x1333f100)
Worker  0: sending 1 tasks (task.fn 0x121e0a6c) to Worker  5
Worker  0: preparing 1 task(s) for worker 20 with function address 0x121e0a6c
Worker  0: sending 1 tasks (task.fn 0x121e0a6c) to Worker 20
Worker  5: received a task with function address 0x121e0a6c (Channel 0x54000b60)
Worker  5: running task.fn 0x121e0a6c
Matrix[256, 0] (thread 5)
...

Worker 0 starts working on the outer loop 0x121e1159 and generates multiple inner loop tasks 0x121e0a6c. When it receives a steal request, it distributes those first. Hence for nested data parallelism we are first distributing work on the inner loop similar to a Parallel Depth First scheduler. Unfortunately with a message-passing based approach that means that worker threads have to wait for the producer thread more often, increasing scheduler overhead. If we distribute the outer loop first, even with tasks pending more threads will be able to produce significant amount of work and thieves will fail to find work less often.

However for non nested-parallelism where work comes as a single loop like for the pathological histogram and logsumexp (#35) this changes nothing unless we also change the initial task enqueue for parallel loops which for for-loop is currently winner-takes-all (i.e. there is no splitting).

mratsim commented 4 years ago

Another interesting log. When a child backs off, the parent will wake it and send it works when it finds it (i.e. lifeline see Saraswat, Lifeline-based Global Load Balancing, http://www.cs.columbia.edu/~martha/courses/4130/au12/p201-saraswat.pdf)

This is what happens in this log, the worker even tries to split tasks for its grandchildren but ultimately only send work to its direct child.

Worker  9: has 1 steal requests
Worker  9: found 6 steal requests addressed to its child 19 and grandchildren
Worker  9: 14 steps left (start: 0, current: 256, stop: 2000, stride: 128, 7 thieves)
Worker  9: Sending [1792, 2000) to worker 19
Worker  9: sending 1 tasks (task.fn 0x121e0a6c) to Worker 19
Worker  9: Continuing with [256, 1792)
Worker  9: waking up child 19
Matrix[1024, 128] (thread 9)
Matrix[1024, 256] (thread 9)
Matrix[1024, 384] (thread 9)
Matrix[1024, 512] (thread 9)
Matrix[1024, 640] (thread 9)
Matrix[1024, 768] (thread 9)
Matrix[1024, 896] (thread 9)
Matrix[1024, 1024] (thread 9)
Matrix[1024, 1152] (thread 9)
Matrix[1024, 1280] (thread 9)
Matrix[1024, 1408] (thread 9)
Matrix[1024, 1536] (thread 9)
Matrix[1024, 1664] (thread 9)
Worker 19: received a task with function address 0x121e0a6c (Channel 0x1c000b60)
Worker 19: running task.fn 0x121e0a6c
Matrix[1024, 1792] (thread 19)
Worker  9: sending own steal request to  0 (Channel 0x1333ce00)
Worker 15: sends state passively WAITING to its parent worker 7
Matrix[1024, 1920] (thread 19)
Worker 19: sending own steal request to  9 (Channel 0x1333d700)
mratsim commented 4 years ago

Very in-depth analysis of MKL, MKL-DNN, Eigen and parallelism and threadpool in the context of deep-learning workload: https://arxiv.org/pdf/1908.04705.pdf

mratsim commented 4 years ago

After #83, Weave achieves 2.53TFlops.

Some figures to keep a realistic performance target:

Perfect multithreaded scaling on 18 cores would be 2.88 TFlops, reaching over 3TFlops would need significantly improving single-threaded GEMM kernel which is out-of-scope for Weave and should be tackled in Laser. Perfect scaling is probably not possible due to Hyperthreading thrashing the cache and TLB of sibling threads but achieving Laser OpenMP performance of 2.7 TFlops is a nice target.

Theoretical maximum

is computed by multiplying

Component Multiplier Cumulated Comment
CPU Frequency 3.5 billions 3.5 GFlops Overclocked i9-9980XE with 3.5GHz AVX512 all core turbo
FMA - Fused Multiply Add per cycle 2 7.0 GFlops (i9 and Xeon Gold/Platinum have 2x AVX512 per core while Xeon Bronze and Silver only have 1x)
Instr per FMA 2 14.0 GFlops 1 FMA is 1 add + 1 mul
Vector width 16 224.0 GFlops AVX512 SIMD processes 16 float32 at a time
mratsim commented 4 years ago

It seems like without Backoff, we actually do reach the 2.7 TFlops

$  ./build/weave_master_no_backoff 

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 2002 ms
Average time: 5.277 ms
Stddev  time: 0.954 ms
Min     time: 5.000 ms
Max     time: 15.000 ms
Perf:         2682.712 GFLOP/s
mratsim commented 4 years ago

Closed by #94

We can reach over 2.8TFlops from a 160~170GFlops single threaded baseline (17.5x speedup) while MKL reaches 3.4TFlops from a 210 GFlops baseline (16.2x) and OpenBLAS 2.8 TFlops from a 193GFlops baseline (14.5) and MKL-DNN 3.0TFlops from a 200 GFlops baseline (15x speedup).

So the issue on multithreading front and perf vs Laser/OpenMP is solved and even with the slower kernel, the performance is very good on high-core-count machines.

Now there is still an issue with nesting. The new code still uses syncRoot which prevents nesting in other parallel region (like batched matmul). A simple nestable kernel would await/sync the outer and inner loop but it seems that it leads to load imbalance as the performance reached is at most 1.8TFlops.

Maybe the kernel could be redesigned to take full advantage of the dataflow approach but it's out of scope of this issue. https://github.com/mratsim/weave/blob/85b90b3f95c75615629aeea34254c47c0de643f8/benchmarks/matmul_gemm_blas/gemm_pure_nim/gemm_weave.nim#L154-L188

image