Closed mratsim closed 4 years ago
See #69
Laser can now reach 2.7 TFlops with GCC: static, guided and dynamic schedule. Clang is at 2.1~2.2GHz.
A nimsuggest process was probably hogging my CPU
Backend: Laser (Pure Nim) + OpenMP
Type: float32
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 224.000 GFLOP/s
Theoretical peak multi: 4032.000 GFLOP/s
Laser production implementation
Collected 300 samples in 1961 ms
Average time: 5.280 ms
Stddev time: 2.303 ms
Min time: 5.000 ms
Max time: 36.000 ms
Perf: 2681.018 GFLOP/s
Some consolation: Intel MKL build with TBB misses the 1TFlops mark as well:
Backend: Intel MKL
Type: float32
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 224.000 GFLOP/s
Theoretical peak multi: 4032.000 GFLOP/s
Intel MKL benchmark
Collected 300 samples in 4576 ms
Average time: 14.217 ms
Stddev time: 3.396 ms
Min time: 8.000 ms
Max time: 46.000 ms
Perf: 995.717 GFLOP/s
Change mkl_gemm.nim.cfg to:
clibdir:"/opt/intel/mkl/lib/intel64"
passl:"/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.a"
passl:"-lmkl_core"
passl:"-lmkl_tbb_thread"
passl:"-ltbb"
dynlibOverride:"mkl_intel_lp64"
but it reaches 3.2 TFlops with Intel OpenMP
Backend: Intel MKL
Type: float32
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 224.000 GFLOP/s
Theoretical peak multi: 4032.000 GFLOP/s
Intel MKL benchmark
Collected 300 samples in 1698 ms
Average time: 4.420 ms
Stddev time: 2.937 ms
Min time: 4.000 ms
Max time: 53.000 ms
Perf: 3202.664 GFLOP/s
Main issue fixed in 36507ec by exposing more parallelism by parallelizing 3 nested loops instead of just 2. Perf is now 1.8 to 2TFlops on my machine which is much more reasonable:
Important: Just adding loadBalance(Weave)
instead of parallelizing the inner loop is not enough.
Stats
Backend: Weave (Pure Nim)
Type: float32
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 224.000 GFLOP/s
Theoretical peak multi: 4032.000 GFLOP/s
Weave implementation
Collected 300 samples in 9341 ms
Average time: 30.083 ms
Stddev time: 7.052 ms
Min time: 20.000 ms
Max time: 75.000 ms
Perf: 470.552 GFLOP/s
Worker | Exec task | send recv steal requests | send receive tasks | enque dequeue tesks | idle | total | |
---|---|---|---|---|---|---|---|
Timer | 31 | 67162.607 | 584945.912 | 5808.580 | 378.542 | 31300.682 | 689596.323 |
Timer | 6 | 151208.192 | 384442.585 | 25177.179 | 551.339 | 132284.510 | 693663.805 |
Timer | 3 | 118192.217 | 462848.264 | 18416.713 | 694.853 | 96261.720 | 696413.767 |
Timer | 33 | 84428.756 | 568342.303 | 5946.973 | 516.956 | 31725.726 | 690960.713 |
Timer | 29 | 194235.677 | 463822.996 | 6026.209 | 891.891 | 32493.478 | 697470.251 |
Timer | 12 | 147331.527 | 426970.998 | 18147.269 | 590.404 | 94001.973 | 687042.171 |
Timer | 26 | 206079.226 | 451966.946 | 6387.025 | 1081.511 | 34710.491 | 700225.199 |
Timer | 13 | 119520.727 | 442308.723 | 19510.054 | 450.167 | 103730.795 | 685520.465 |
Timer | 25 | 197890.720 | 455604.535 | 6888.388 | 979.174 | 36978.200 | 698341.016 |
Timer | 27 | 195427.110 | 462994.971 | 6229.360 | 978.685 | 34077.716 | 699707.842 |
Timer | 4 | 100525.443 | 457112.698 | 21240.117 | 567.968 | 110874.055 | 690320.281 |
Timer | 30 | 195968.538 | 460742.005 | 6234.480 | 827.435 | 33412.842 | 697185.300 |
Timer | 7 | 84923.680 | 506766.125 | 15309.847 | 280.326 | 81109.847 | 688389.825 |
Timer | 24 | 182246.102 | 468302.946 | 6579.388 | 570.952 | 35535.514 | 693234.901 |
Timer | 5 | 159109.065 | 387777.447 | 23667.322 | 599.713 | 124177.060 | 695330.607 |
Timer | 32 | 60194.238 | 589304.610 | 6176.448 | 275.104 | 32228.432 | 688178.832 |
Timer | 19 | 122937.480 | 518342.104 | 7731.387 | 408.553 | 41672.433 | 691091.958 |
Timer | 14 | 117765.172 | 442969.256 | 19849.724 | 443.428 | 104462.701 | 685490.281 |
Timer | 20 | 108203.168 | 529161.627 | 8009.311 | 281.924 | 43320.310 | 688976.340 |
Timer | 10 | 81186.363 | 499063.640 | 16709.611 | 279.777 | 86853.296 | 684092.688 |
Timer | 22 | 126609.674 | 516353.114 | 7581.768 | 571.655 | 41083.725 | 692199.936 |
Timer | 34 | 80312.832 | 576296.029 | 5252.064 | 415.060 | 27558.386 | 689834.370 |
Timer | 35 | 88949.289 | 563356.530 | 5890.700 | 545.931 | 33161.699 | 691904.150 |
Timer | 23 | 189875.043 | 465690.975 | 6054.358 | 622.547 | 32874.200 | 695117.124 |
Timer | 8 | 78706.643 | 510926.490 | 15427.789 | 291.401 | 79937.033 | 685289.356 |
Timer | 11 | 141243.889 | 443054.503 | 16606.129 | 564.815 | 85750.634 | 687219.970 |
Timer | 9 | 84888.088 | 500973.331 | 15900.617 | 313.543 | 82549.751 | 684625.330 |
Timer | 18 | 95157.213 | 541357.110 | 7966.062 | 165.602 | 42892.018 | 687538.005 |
Timer | 1 | 177358.475 | 402236.309 | 19697.230 | 1334.880 | 103265.045 | 703891.938 |
Timer | 15 | 62995.317 | 558259.037 | 10802.165 | 515.058 | 54284.070 | 686855.647 |
Timer | 2 | 228236.442 | 313500.180 | 24887.463 | 933.521 | 131643.930 | 699201.537 |
Timer | 0 | 112908.708 | 371233.642 | 41056.940 | 207.885 | 219580.684 | 744987.859 |
Timer | 21 | 122821.680 | 520134.103 | 7515.099 | 527.000 | 40609.517 | 691607.399 |
Timer | 16 | 67990.787 | 549372.248 | 11306.038 | 418.100 | 57625.057 | 686712.230 |
Timer | 28 | 190540.991 | 465020.506 | 6446.193 | 940.650 | 34632.296 | 697580.635 |
Timer | 17 | 87072.270 | 546874.811 | 9151.320 | 517.453 | 46577.786 | 690193.639 |
Backend: Weave (Pure Nim)
Type: float32
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 224.000 GFLOP/s
Theoretical peak multi: 4032.000 GFLOP/s
Weave implementation
Collected 300 samples in 2606 ms
Average time: 7.730 ms
Stddev time: 4.088 ms
Min time: 6.000 ms
Max time: 49.000 ms
Perf: 1831.278 GFLOP/s
Worker | Exec task | send recv steal requests | send receive tasks | enque dequeue tesks | idle | total | |
---|---|---|---|---|---|---|---|
Timer | 14 | 109369.554 | 54669.369 | 4734.948 | 504.536 | 24037.765 | 193316.173 |
Timer | 1 | 114960.762 | 48428.423 | 5055.483 | 908.356 | 24857.134 | 194210.158 |
Timer | 11 | 110646.820 | 54330.393 | 4764.520 | 477.942 | 23372.801 | 193592.475 |
Timer | 12 | 109773.613 | 54253.606 | 4813.124 | 524.362 | 23993.648 | 193358.353 |
Timer | 28 | 105961.124 | 62151.888 | 4049.457 | 385.197 | 20934.560 | 193482.226 |
Timer | 27 | 106384.220 | 62214.171 | 4098.170 | 430.601 | 20985.437 | 194112.600 |
Timer | 29 | 106537.063 | 59922.506 | 4316.233 | 383.995 | 22539.094 | 193698.891 |
Timer | 9 | 103471.652 | 61155.561 | 4743.777 | 394.963 | 23428.874 | 193194.827 |
Timer | 24 | 107585.847 | 59358.275 | 4322.620 | 395.578 | 22084.356 | 193746.676 |
Timer | 3 | 109124.665 | 54916.677 | 4961.589 | 563.696 | 24541.015 | 194107.642 |
Timer | 15 | 98598.742 | 68661.446 | 4415.255 | 354.507 | 21791.132 | 193821.082 |
Timer | 6 | 117904.793 | 46353.822 | 4907.058 | 699.744 | 24008.840 | 193874.256 |
Timer | 10 | 102731.402 | 61727.425 | 4765.494 | 393.112 | 23503.577 | 193121.010 |
Timer | 21 | 101140.485 | 66311.005 | 4125.495 | 307.755 | 21271.194 | 193155.934 |
Timer | 30 | 106786.435 | 59814.345 | 4322.485 | 416.065 | 22394.376 | 193733.705 |
Timer | 18 | 100103.064 | 69202.506 | 3913.947 | 309.698 | 20024.074 | 193553.289 |
Timer | 34 | 95681.374 | 74938.894 | 3720.593 | 244.018 | 19519.236 | 194104.115 |
Timer | 33 | 95485.940 | 74077.942 | 3730.041 | 258.876 | 19686.630 | 193239.429 |
Timer | 0 | 106869.264 | 33199.361 | 6972.569 | 499.693 | 35513.367 | 183054.254 |
Timer | 4 | 109051.911 | 54881.134 | 4907.741 | 558.008 | 24225.744 | 193624.538 |
Timer | 19 | 101141.742 | 66311.351 | 4248.495 | 307.546 | 21939.490 | 193948.624 |
Timer | 26 | 106838.971 | 60474.659 | 4231.349 | 438.965 | 21670.470 | 193654.415 |
Timer | 17 | 100023.696 | 68273.456 | 4243.542 | 368.127 | 20953.964 | 193862.785 |
Timer | 25 | 106151.732 | 59627.085 | 4462.027 | 402.793 | 22925.111 | 193568.747 |
Timer | 16 | 98058.453 | 69144.517 | 4392.969 | 375.358 | 21656.227 | 193627.525 |
Timer | 35 | 97473.807 | 73575.877 | 3740.237 | 257.905 | 19450.109 | 194497.935 |
Timer | 2 | 128860.410 | 38191.806 | 4546.105 | 1424.634 | 21788.005 | 194810.960 |
Timer | 31 | 95783.863 | 74022.193 | 3570.980 | 245.846 | 20479.436 | 194102.319 |
Timer | 7 | 102732.876 | 61364.221 | 4931.404 | 418.762 | 24432.209 | 193879.473 |
Timer | 5 | 118251.584 | 46295.838 | 4809.459 | 783.144 | 23747.035 | 193887.060 |
Timer | 20 | 100742.140 | 66050.523 | 4239.576 | 299.596 | 22101.823 | 193433.658 |
Timer | 8 | 102344.588 | 62243.333 | 4782.607 | 423.864 | 23509.571 | 193303.964 |
Timer | 23 | 107622.646 | 59494.391 | 4287.997 | 401.905 | 22084.048 | 193890.986 |
Timer | 13 | 109745.142 | 56445.965 | 4528.521 | 536.654 | 22222.765 | 193479.046 |
Timer | 32 | 96283.180 | 74841.103 | 3635.348 | 260.648 | 18624.449 | 193644.728 |
Timer | 22 | 101762.061 | 67560.103 | 3919.838 | 309.118 | 20105.694 | 193656.813 |
Very interesting and thorough analysis of TBB overhead vs static scheduling: http://www.cs.yale.edu/homes/abhishek/abhishek-taco11.pdf
After #74 and changing work-stealing victim selection, we can now reach 2.3 TFlops. Note: the change benefits coarse-grain parallelism but hurts fine-grain parallelism performance. Though Weave is already excellent on that front.
Interesting readings:
Acar, Charguéraud, Rainey, Scheduling Parallel Program by Work-Stealing with Private Deque, https://www.chargueraud.org/research/2013/ppopp/full.pdf
Li, Hu, Cheng, Zhao, Aynchronous Work Stealing on Distributed Memory System, www.escience.cn/system/file?fileId=79343
~Experiments with WV_MaxConcurrentStealPerWorker and WV_StealEarly shows that the current default offer the best performance. Changing both at the same time can even lead to a livelock with task ping-pong.~ Edit, StealEarly was not implemented, fixed in 17885dd
The idea was that workers would get a huge chunk of work upfront (similar to eager binary splitting or OpenMP scheduling) and wouldn't have to find more later.
After implementation the following configuration is stably at 2.28~2.30TFlops
Some ideas:
This would give up the main benefit of Weave which is being purely channel based and easier to extend with distributed computing, easier to formally verify and future hardware proof, assuming we will have in the future more cluster-on-chips and that hardware-message passing primitives will be available but memory coherency will be too expensive to maintain (i.e. very costly atomics). (for example Mellanox/Tilera 100+ARM cores, Intel SCC, Adapteva 1024 ARM cores)
Interrupts might be tricky for embedding in other programs, say Python. It will also flush the cache at an inopportune time (when the core is doing carefully tuned AVX512 instructions to use 100% of the CPU cycles)
Similar to Latency hiding work-stealing: http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-112R.pdf it might want to have another private queue for statically scheduled tasks with a static parallelFor.
Tasks sent in that queue cannot be stolen and there would be no load balancing. A "schedule: static" field could be added to the parallelFor construct.
This would allow:
This can be extended to:
Both would help for IO tasks #22 with long-running event-loops on the static queue, they may spawn multithreaded computation on the work-stealing queue. Properly designing this for composition would allow not having to hardcode an event-loop in Weave, meaning it could be extended to anything that requires a long-running polling thread.
When scheduling the following nested loops and trying to see if the Worker 0 distributes the i loop first (outer loop, more work sent per i iteration) or the j loop first (inner loop, less work sent per j iteration):
import
../weave,
../weave/instrumentation/loggers,
../weave/contexts
init(Weave)
parallelForStrided i in 0 ..< 2000, stride = 256:
parallelForStrided j in 0 ..< 2000, stride = 128:
captures: {i}
log("Matrix[%d, %d] (thread %d)\n", i, j, myID())
exit(Weave)
Logs are:
Worker 14: sending own steal request to 5 (Channel 0xf4a70f90)
Worker 35: sending own steal request to 31 (Channel 0xf4a717b0)
Worker 1: sending own steal request to 35 (Channel 0xf4a718f0)
Worker 0: scheduling task.fn 0xf3aff4d4
Worker 29: sending own steal request to 20 (Channel 0xf4a71440)
Worker 16: sending own steal request to 4 (Channel 0xf4a70f40)
Worker 23: sending own steal request to 16 (Channel 0xf4a71300)
Worker 2: sending own steal request to 7 (Channel 0xf4a71030)
Worker 20: sending own steal request to 12 (Channel 0xf4a711c0)
Worker 31: sending own steal request to 2 (Channel 0xf4a70ea0)
Worker 30: sending own steal request to 9 (Channel 0xf4a710d0)
Worker 34: sending own steal request to 35 (Channel 0xf4a718f0)
Worker 32: sending own steal request to 25 (Channel 0xf4a715d0)
Worker 17: sending own steal request to 12 (Channel 0xf4a711c0)
Worker 26: sending own steal request to 9 (Channel 0xf4a710d0)
Worker 8: sending own steal request to 14 (Channel 0xf4a71260)
Worker 15: sending own steal request to 8 (Channel 0xf4a71080)
Worker 33: sending own steal request to 10 (Channel 0xf4a71120)
Worker 28: sending own steal request to 22 (Channel 0xf4a714e0)
Worker 10: sending own steal request to 23 (Channel 0xf4a71530)
Worker 7: sending own steal request to 25 (Channel 0xf4a715d0)
Worker 13: sending own steal request to 35 (Channel 0xf4a718f0)
Worker 25: sending own steal request to 29 (Channel 0xf4a71710)
Worker 9: sending own steal request to 7 (Channel 0xf4a71030)
Worker 27: sending own steal request to 35 (Channel 0xf4a718f0)
Worker 6: sending own steal request to 25 (Channel 0xf4a715d0)
Worker 24: sending own steal request to 28 (Channel 0xf4a716c0)
Worker 19: sending own steal request to 24 (Channel 0xf4a71580)
Worker 4: sending own steal request to 11 (Channel 0xf4a71170)
Worker 22: sending own steal request to 6 (Channel 0xf4a70fe0)
Worker 3: sending own steal request to 15 (Channel 0xf4a712b0)
Worker 21: sending own steal request to 16 (Channel 0xf4a71300)
Worker 11: sending own steal request to 20 (Channel 0xf4a71440)
Worker 0: preparing 1 task(s) for worker 24 with function address 0xf3aff4d4
Worker 0: sending 1 tasks (task.fn 0xf3aff4d4) to Worker 24
Worker 24: received a task with function address 0xf3aff4d4 (Channel 0x90000b60)
>>> Worker 0 enters barrier <<<
Worker 0: globalsync 1 - task from local deque
Worker 24: running task.fn 0xf3aff4d4
Worker 0: globalsync 2 - becoming a thief
Worker 0: sending own steal request to 2 (Channel 0xf4a70ea0)
Worker 12: sending own steal request to 0 (Channel 0xf4a70e00)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 22 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 22
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[256, 0] (thread 23)
Matrix[256, 128] (thread 23)
Matrix[256, 256] (thread 23)
Matrix[256, 384] (thread 23)
Matrix[256, 512] (thread 23)
Matrix[256, 640] (thread 23)
Matrix[256, 768] (thread 23)
Matrix[256, 896] (thread 23)
Matrix[256, 1024] (thread 23)
Matrix[256, 1152] (thread 23)
Matrix[256, 1280] (thread 23)
Matrix[256, 1408] (thread 23)
Matrix[256, 1536] (thread 23)
Matrix[256, 1664] (thread 23)
Matrix[256, 1792] (thread 23)
Matrix[256, 1920] (thread 23)
Worker 24: preparing 1 task(s) for worker 0 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 0
Worker 0: received a task with function address 0xf3afede7 (Channel 0xf4a74dd0)
Worker 0: globalsync 3 - stoled tasks
Worker 0: globalsync 4 - sharing work
Worker 0: globalsync 5 - working on leftover
Worker 0: running task.fn 0xf3afede7
Matrix[512, 0] (thread 0)
Matrix[512, 128] (thread 0)
Matrix[512, 256] (thread 0)
Matrix[512, 384] (thread 0)
Matrix[512, 512] (thread 0)
Matrix[512, 640] (thread 0)
Matrix[512, 768] (thread 0)
Matrix[512, 896] (thread 0)
Matrix[512, 1024] (thread 0)
Matrix[512, 1152] (thread 0)
Matrix[512, 1280] (thread 0)
Matrix[512, 1408] (thread 0)
Matrix[512, 1536] (thread 0)
Matrix[512, 1664] (thread 0)
Matrix[512, 1792] (thread 0)
Matrix[512, 1920] (thread 0)
Worker 0: globalsync 2 - becoming a thief
Worker 0: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 0 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 0
Worker 0: received a task with function address 0xf3afede7 (Channel 0xf4a74dd0)
Worker 0: globalsync 3 - stoled tasks
Worker 0: globalsync 4 - sharing work
Worker 0: globalsync 5 - working on leftover
Worker 0: running task.fn 0xf3afede7
Matrix[768, 0] (thread 0)
Matrix[768, 128] (thread 0)
Matrix[768, 256] (thread 0)
Matrix[768, 384] (thread 0)
Matrix[768, 512] (thread 0)
Matrix[768, 640] (thread 0)
Matrix[768, 768] (thread 0)
Matrix[768, 896] (thread 0)
Matrix[768, 1024] (thread 0)
Matrix[768, 1152] (thread 0)
Matrix[768, 1280] (thread 0)
Matrix[768, 1408] (thread 0)
Matrix[768, 1536] (thread 0)
Matrix[768, 1664] (thread 0)
Matrix[768, 1792] (thread 0)
Matrix[768, 1920] (thread 0)
Worker 0: globalsync 2 - becoming a thief
Worker 0: sending own steal request to 24 (Channel 0xf4a71580)
Worker 23: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 0 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 0
Worker 0: received a task with function address 0xf3afede7 (Channel 0xf4a74dd0)
Worker 0: globalsync 3 - stoled tasks
Worker 0: globalsync 4 - sharing work
Worker 0: globalsync 5 - working on leftover
Worker 0: running task.fn 0xf3afede7
Matrix[1024, 0] (thread 0)
Matrix[1024, 128] (thread 0)
Matrix[1024, 256] (thread 0)
Matrix[1024, 384] (thread 0)
Worker 22: received a task with function address 0xf3afede7 (Channel 0xac000b60)
Worker 22: running task.fn 0xf3afede7
Matrix[1024, 512] (thread 0)
Matrix[1024, 640] (thread 0)
Matrix[1024, 768] (thread 0)
Matrix[1024, 896] (thread 0)
Matrix[1024, 1024] (thread 0)
Matrix[1024, 1152] (thread 0)
Matrix[1024, 1280] (thread 0)
Matrix[1024, 1408] (thread 0)
Matrix[1024, 1536] (thread 0)
Matrix[1024, 1664] (thread 0)
Matrix[1024, 1792] (thread 0)
Matrix[1024, 1920] (thread 0)
Worker 0: globalsync 2 - becoming a thief
Worker 0: sending own steal request to 24 (Channel 0xf4a71580)
Matrix[0, 0] (thread 22)
Matrix[0, 128] (thread 22)
Matrix[0, 256] (thread 22)
Matrix[0, 384] (thread 22)
Matrix[0, 512] (thread 22)
Matrix[0, 640] (thread 22)
Matrix[0, 768] (thread 22)
Matrix[0, 896] (thread 22)
Matrix[0, 1024] (thread 22)
Matrix[0, 1152] (thread 22)
Matrix[0, 1280] (thread 22)
Matrix[0, 1408] (thread 22)
Matrix[0, 1536] (thread 22)
Matrix[0, 1664] (thread 22)
Matrix[0, 1792] (thread 22)
Matrix[0, 1920] (thread 22)
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[1280, 0] (thread 23)
Matrix[1280, 128] (thread 23)
Matrix[1280, 256] (thread 23)
Worker 22: sending own steal request to 24 (Channel 0xf4a71580)
Matrix[1280, 384] (thread 23)
Matrix[1280, 512] (thread 23)
Matrix[1280, 640] (thread 23)
Matrix[1280, 768] (thread 23)
Matrix[1280, 896] (thread 23)
Matrix[1280, 1024] (thread 23)
Matrix[1280, 1152] (thread 23)
Matrix[1280, 1280] (thread 23)
Matrix[1280, 1408] (thread 23)
Matrix[1280, 1536] (thread 23)
Matrix[1280, 1664] (thread 23)
Matrix[1280, 1792] (thread 23)
Matrix[1280, 1920] (thread 23)
Worker 23: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[1536, 0] (thread 23)
Matrix[1536, 128] (thread 23)
Matrix[1536, 256] (thread 23)
Matrix[1536, 384] (thread 23)
Matrix[1536, 512] (thread 23)
Matrix[1536, 640] (thread 23)
Matrix[1536, 768] (thread 23)
Matrix[1536, 896] (thread 23)
Matrix[1536, 1024] (thread 23)
Matrix[1536, 1152] (thread 23)
Matrix[1536, 1280] (thread 23)
Matrix[1536, 1408] (thread 23)
Matrix[1536, 1536] (thread 23)
Matrix[1536, 1664] (thread 23)
Matrix[1536, 1792] (thread 23)
Matrix[1536, 1920] (thread 23)
Worker 23: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[1792, 0] (thread 23)
Matrix[1792, 128] (thread 23)
Matrix[1792, 256] (thread 23)
Matrix[1792, 384] (thread 23)
Matrix[1792, 512] (thread 23)
Matrix[1792, 640] (thread 23)
Matrix[1792, 768] (thread 23)
Matrix[1792, 896] (thread 23)
Matrix[1792, 1024] (thread 23)
Matrix[1792, 1152] (thread 23)
Matrix[1792, 1280] (thread 23)
Matrix[1792, 1408] (thread 23)
Matrix[1792, 1536] (thread 23)
Matrix[1792, 1664] (thread 23)
Matrix[1792, 1792] (thread 23)
Matrix[1792, 1920] (thread 23)
The first thing we see in the logs is that when scheduling, the whole task is sent even if it's splittable https://github.com/mratsim/weave/blob/2cebb3d2a51c26c186672246d6e3da2a48bc4438/weave/scheduler.nim#L266-L289
The second thing that we see is that somehow there was no task splitting at all i.e. the following log entries are missing: https://github.com/mratsim/weave/blob/2cebb3d2a51c26c186672246d6e3da2a48bc4438/weave/victims.nim#L226-L245
Further analysis of the logs shows a promising way to close the gap with OpenMP on GEMM (and hopefully coarse-grain non-nested parallel loops as well) with minimal change (which does not undermine the potential value of pinned tasks).
Worker 16: sending own steal request to 4 (Channel 0x1333d200)
Worker 14: sending own steal request to 5 (Channel 0x1333d300)
Worker 18: sending own steal request to 32 (Channel 0x1333ee00)
Worker 30: sending own steal request to 9 (Channel 0x1333d700)
Worker 9: sending own steal request to 7 (Channel 0x1333d500)
Worker 20: sending own steal request to 12 (Channel 0x1333da00)
Worker 11: sending own steal request to 20 (Channel 0x1333e200)
Worker 31: sending own steal request to 2 (Channel 0x1333d000)
Worker 32: sending own steal request to 25 (Channel 0x1333e700)
Worker 13: sending own steal request to 35 (Channel 0x1333f100)
Worker 0: scheduling task.fn 0x121e1159 (0 pending)
>>> Worker 0 enters barrier <<<
Worker 0: globalsync 1 - task from local deque
Worker 0: running task.fn 0x121e1159
Worker 0: scheduling task.fn 0x121e0a6c (0 pending)
Worker 0: scheduling task.fn 0x121e0a6c (1 pending)
Worker 0: scheduling task.fn 0x121e0a6c (2 pending)
Worker 1: sending own steal request to 35 (Channel 0x1333f100)
Worker 22: sending own steal request to 6 (Channel 0x1333d400)
Worker 19: sending own steal request to 24 (Channel 0x1333e600)
Worker 8: sending own steal request to 14 (Channel 0x1333dc00)
Worker 28: sending own steal request to 22 (Channel 0x1333e400)
Worker 10: sending own steal request to 23 (Channel 0x1333e500)
Worker 12: sending own steal request to 0 (Channel 0x1333ce00)
Worker 35: sending own steal request to 31 (Channel 0x1333ed00)
Worker 17: sending own steal request to 12 (Channel 0x1333da00)
Worker 27: sending own steal request to 35 (Channel 0x1333f100)
Worker 24: sending own steal request to 28 (Channel 0x1333ea00)
Worker 5: sending own steal request to 22 (Channel 0x1333e400)
Worker 23: sending own steal request to 16 (Channel 0x1333de00)
Worker 6: sending own steal request to 25 (Channel 0x1333e700)
Worker 25: sending own steal request to 29 (Channel 0x1333eb00)
Worker 7: sending own steal request to 25 (Channel 0x1333e700)
Worker 3: sending own steal request to 15 (Channel 0x1333dd00)
Worker 21: sending own steal request to 16 (Channel 0x1333de00)
Worker 15: sending own steal request to 8 (Channel 0x1333d600)
Worker 33: sending own steal request to 10 (Channel 0x1333d800)
Worker 2: sending own steal request to 7 (Channel 0x1333d500)
Worker 4: sending own steal request to 11 (Channel 0x1333d900)
Worker 29: sending own steal request to 20 (Channel 0x1333e200)
Worker 0: preparing 1 task(s) for worker 12 with function address 0x121e0a6c
Worker 0: sending 1 tasks (task.fn 0x121e0a6c) to Worker 12
Worker 0: preparing 1 task(s) for worker 5 with function address 0x121e0a6c
Worker 26: sending own steal request to 9 (Channel 0x1333d700)
Worker 12: received a task with function address 0x121e0a6c (Channel 0x34000b60)
Worker 12: running task.fn 0x121e0a6c
Matrix[0, 0] (thread 12)
Matrix[0, 128] (thread 12)
Matrix[0, 256] (thread 12)
Matrix[0, 384] (thread 12)
Matrix[0, 512] (thread 12)
Matrix[0, 640] (thread 12)
Matrix[0, 768] (thread 12)
Matrix[0, 896] (thread 12)
Matrix[0, 1024] (thread 12)
Matrix[0, 1152] (thread 12)
Matrix[0, 1280] (thread 12)
Matrix[0, 1408] (thread 12)
Matrix[0, 1536] (thread 12)
Matrix[0, 1664] (thread 12)
Matrix[0, 1792] (thread 12)
Matrix[0, 1920] (thread 12)
Worker 12: sending own steal request to 0 (Channel 0x1333ce00)
Worker 34: sending own steal request to 35 (Channel 0x1333f100)
Worker 0: sending 1 tasks (task.fn 0x121e0a6c) to Worker 5
Worker 0: preparing 1 task(s) for worker 20 with function address 0x121e0a6c
Worker 0: sending 1 tasks (task.fn 0x121e0a6c) to Worker 20
Worker 5: received a task with function address 0x121e0a6c (Channel 0x54000b60)
Worker 5: running task.fn 0x121e0a6c
Matrix[256, 0] (thread 5)
...
Worker 0 starts working on the outer loop 0x121e1159 and generates multiple inner loop tasks 0x121e0a6c. When it receives a steal request, it distributes those first. Hence for nested data parallelism we are first distributing work on the inner loop similar to a Parallel Depth First scheduler. Unfortunately with a message-passing based approach that means that worker threads have to wait for the producer thread more often, increasing scheduler overhead. If we distribute the outer loop first, even with tasks pending more threads will be able to produce significant amount of work and thieves will fail to find work less often.
However for non nested-parallelism where work comes as a single loop like for the pathological histogram and logsumexp (#35) this changes nothing unless we also change the initial task enqueue for parallel loops which for for-loop is currently winner-takes-all (i.e. there is no splitting).
Another interesting log. When a child backs off, the parent will wake it and send it works when it finds it (i.e. lifeline see Saraswat, Lifeline-based Global Load Balancing, http://www.cs.columbia.edu/~martha/courses/4130/au12/p201-saraswat.pdf)
This is what happens in this log, the worker even tries to split tasks for its grandchildren but ultimately only send work to its direct child.
Worker 9: has 1 steal requests
Worker 9: found 6 steal requests addressed to its child 19 and grandchildren
Worker 9: 14 steps left (start: 0, current: 256, stop: 2000, stride: 128, 7 thieves)
Worker 9: Sending [1792, 2000) to worker 19
Worker 9: sending 1 tasks (task.fn 0x121e0a6c) to Worker 19
Worker 9: Continuing with [256, 1792)
Worker 9: waking up child 19
Matrix[1024, 128] (thread 9)
Matrix[1024, 256] (thread 9)
Matrix[1024, 384] (thread 9)
Matrix[1024, 512] (thread 9)
Matrix[1024, 640] (thread 9)
Matrix[1024, 768] (thread 9)
Matrix[1024, 896] (thread 9)
Matrix[1024, 1024] (thread 9)
Matrix[1024, 1152] (thread 9)
Matrix[1024, 1280] (thread 9)
Matrix[1024, 1408] (thread 9)
Matrix[1024, 1536] (thread 9)
Matrix[1024, 1664] (thread 9)
Worker 19: received a task with function address 0x121e0a6c (Channel 0x1c000b60)
Worker 19: running task.fn 0x121e0a6c
Matrix[1024, 1792] (thread 19)
Worker 9: sending own steal request to 0 (Channel 0x1333ce00)
Worker 15: sends state passively WAITING to its parent worker 7
Matrix[1024, 1920] (thread 19)
Worker 19: sending own steal request to 9 (Channel 0x1333d700)
Very in-depth analysis of MKL, MKL-DNN, Eigen and parallelism and threadpool in the context of deep-learning workload: https://arxiv.org/pdf/1908.04705.pdf
After #83, Weave achieves 2.53TFlops.
Some figures to keep a realistic performance target:
Perfect multithreaded scaling on 18 cores would be 2.88 TFlops, reaching over 3TFlops would need significantly improving single-threaded GEMM kernel which is out-of-scope for Weave and should be tackled in Laser. Perfect scaling is probably not possible due to Hyperthreading thrashing the cache and TLB of sibling threads but achieving Laser OpenMP performance of 2.7 TFlops is a nice target.
is computed by multiplying
Component | Multiplier | Cumulated | Comment |
---|---|---|---|
CPU Frequency | 3.5 billions | 3.5 GFlops | Overclocked i9-9980XE with 3.5GHz AVX512 all core turbo |
FMA - Fused Multiply Add per cycle | 2 | 7.0 GFlops | (i9 and Xeon Gold/Platinum have 2x AVX512 per core while Xeon Bronze and Silver only have 1x) |
Instr per FMA | 2 | 14.0 GFlops | 1 FMA is 1 add + 1 mul |
Vector width | 16 | 224.0 GFlops | AVX512 SIMD processes 16 float32 at a time |
It seems like without Backoff, we actually do reach the 2.7 TFlops
$ ./build/weave_master_no_backoff
Backend: Weave (Pure Nim)
Type: float32
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 224.000 GFLOP/s
Theoretical peak multi: 4032.000 GFLOP/s
Weave implementation
Collected 300 samples in 2002 ms
Average time: 5.277 ms
Stddev time: 0.954 ms
Min time: 5.000 ms
Max time: 15.000 ms
Perf: 2682.712 GFLOP/s
Closed by #94
We can reach over 2.8TFlops from a 160~170GFlops single threaded baseline (17.5x speedup) while MKL reaches 3.4TFlops from a 210 GFlops baseline (16.2x) and OpenBLAS 2.8 TFlops from a 193GFlops baseline (14.5) and MKL-DNN 3.0TFlops from a 200 GFlops baseline (15x speedup).
So the issue on multithreading front and perf vs Laser/OpenMP is solved and even with the slower kernel, the performance is very good on high-core-count machines.
Now there is still an issue with nesting. The new code still uses syncRoot which prevents nesting in other parallel region (like batched matmul). A simple nestable kernel would await/sync the outer and inner loop but it seems that it leads to load imbalance as the performance reached is at most 1.8TFlops.
Maybe the kernel could be redesigned to take full advantage of the dataflow approach but it's out of scope of this issue. https://github.com/mratsim/weave/blob/85b90b3f95c75615629aeea34254c47c0de643f8/benchmarks/matmul_gemm_blas/gemm_pure_nim/gemm_weave.nim#L154-L188
Somewhat related to #31 and very probably related to #35
~By giving Laser (OpenMP-based) the time to get CPU at full-speed, it can reach 2.23 TFlops on my CPU:~ see below, nimsuggest was running at 100%
Intel MKL is at 2.85 TFlops
OpenBLAS is at 1.52 TFlops
Unfortunately Weave is stuck at 0.6 TFlops.
First look
On the right more than half of the instructions for Weave are not floating points instructions. Also it seems like the CPU frequency is at 2.7GHz in the case of Weave while Laser reaches 3.6GHz. My BIOS settings are 4.1GHz all core turbo for normal code, 4.0GHz for AVX2 and 3.5GHz for AVX512 code.
Call stack
Looking into the call stack, it seems like we are paying the price of dynamic scheduling :/ as the time spend into the GEMM kernel is similar.
Conclusion
We have very high overhead for coarse-grained parallelism where OpenMP shines. I suspect this is also the reason behind #35.
Suspicions
Additionally I suspect the runtime is polluting the L1 cache with runtime data while GEMM has been carefully tuned to take advantage of L1 and L2 caches. Using smaller tiles might help (similarly avoiding hyperthreading which does the same would probably help as well)