BLAS matmul vs Laser, Intel MKL, MKL-DNN, OpenBLAS and co

Somewhat related to #31 and very probably related to #35

~By giving Laser (OpenMP-based) the time to get CPU at full-speed, it can reach 2.23 TFlops on my CPU:~ see below, nimsuggest was running at 100%

Backend:                        Laser (Pure Nim) + OpenMP
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Laser production implementation
Collected 300 samples in 2245 ms
Average time: 6.327 ms
Stddev  time: 2.929 ms
Min     time: 5.000 ms
Max     time: 48.000 ms
Perf:         2237.478 GFLOP/s

Intel MKL is at 2.85 TFlops

Backend:                        Intel MKL
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Intel MKL benchmark
Collected 300 samples in 1842 ms
Average time: 4.970 ms
Stddev  time: 3.956 ms
Min     time: 4.000 ms
Max     time: 70.000 ms
Perf:         2848.245 GFLOP/s

OpenBLAS is at 1.52 TFlops

Backend:                        OpenBLAS
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

OpenBLAS benchmark
Collected 300 samples in 3124 ms
Average time: 9.300 ms
Stddev  time: 2.637 ms
Min     time: 9.000 ms
Max     time: 40.000 ms
Perf:         1522.126 GFLOP/s

Unfortunately Weave is stuck at 0.6 TFlops.

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 7686 ms
Average time: 24.553 ms
Stddev  time: 4.426 ms
Min     time: 17.000 ms
Max     time: 60.000 ms
Perf:         576.532 GFLOP/s

First look

On the right more than half of the instructions for Weave are not floating points instructions. Also it seems like the CPU frequency is at 2.7GHz in the case of Weave while Laser reaches 3.6GHz. My BIOS settings are 4.1GHz all core turbo for normal code, 4.0GHz for AVX2 and 3.5GHz for AVX512 code.

Call stack

Looking into the call stack, it seems like we are paying the price of dynamic scheduling :/ as the time spend into the GEMM kernel is similar.

Conclusion

We have very high overhead for coarse-grained parallelism where OpenMP shines. I suspect this is also the reason behind #35.

Suspicions

Additionally I suspect the runtime is polluting the L1 cache with runtime data while GEMM has been carefully tuned to take advantage of L1 and L2 caches. Using smaller tiles might help (similarly avoiding hyperthreading which does the same would probably help as well)

See #69

Laser can now reach 2.7 TFlops with GCC: static, guided and dynamic schedule. Clang is at 2.1~2.2GHz.

A nimsuggest process was probably hogging my CPU

Backend:                        Laser (Pure Nim) + OpenMP
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Laser production implementation
Collected 300 samples in 1961 ms
Average time: 5.280 ms
Stddev  time: 2.303 ms
Min     time: 5.000 ms
Max     time: 36.000 ms
Perf:         2681.018 GFLOP/s

Some consolation: Intel MKL build with TBB misses the 1TFlops mark as well:

Backend:                        Intel MKL
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Intel MKL benchmark
Collected 300 samples in 4576 ms
Average time: 14.217 ms
Stddev  time: 3.396 ms
Min     time: 8.000 ms
Max     time: 46.000 ms
Perf:         995.717 GFLOP/s

Change mkl_gemm.nim.cfg to:

clibdir:"/opt/intel/mkl/lib/intel64"
passl:"/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.a"
passl:"-lmkl_core"
passl:"-lmkl_tbb_thread"
passl:"-ltbb"
dynlibOverride:"mkl_intel_lp64"

but it reaches 3.2 TFlops with Intel OpenMP

Backend:                        Intel MKL
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Intel MKL benchmark
Collected 300 samples in 1698 ms
Average time: 4.420 ms
Stddev  time: 2.937 ms
Min     time: 4.000 ms
Max     time: 53.000 ms
Perf:         3202.664 GFLOP/s

Main issue fixed in 36507ec by exposing more parallelism by parallelizing 3 nested loops instead of just 2. Perf is now 1.8 to 2TFlops on my machine which is much more reasonable:

20% faster than OpenBLAS
30% slower than Laser and MKL with GCC OpenMP
38% slower than MKL + Intel OpenMP

Important: Just adding loadBalance(Weave) instead of parallelizing the inner loop is not enough.

Stats

Before

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 9341 ms
Average time: 30.083 ms
Stddev  time: 7.052 ms
Min     time: 20.000 ms
Max     time: 75.000 ms
Perf:         470.552 GFLOP/s

	Worker	Exec task	send recv steal requests	send receive tasks	enque dequeue tesks	idle	total
Timer	31	67162.607	584945.912	5808.580	378.542	31300.682	689596.323
Timer	6	151208.192	384442.585	25177.179	551.339	132284.510	693663.805
Timer	3	118192.217	462848.264	18416.713	694.853	96261.720	696413.767
Timer	33	84428.756	568342.303	5946.973	516.956	31725.726	690960.713
Timer	29	194235.677	463822.996	6026.209	891.891	32493.478	697470.251
Timer	12	147331.527	426970.998	18147.269	590.404	94001.973	687042.171
Timer	26	206079.226	451966.946	6387.025	1081.511	34710.491	700225.199
Timer	13	119520.727	442308.723	19510.054	450.167	103730.795	685520.465
Timer	25	197890.720	455604.535	6888.388	979.174	36978.200	698341.016
Timer	27	195427.110	462994.971	6229.360	978.685	34077.716	699707.842
Timer	4	100525.443	457112.698	21240.117	567.968	110874.055	690320.281
Timer	30	195968.538	460742.005	6234.480	827.435	33412.842	697185.300
Timer	7	84923.680	506766.125	15309.847	280.326	81109.847	688389.825
Timer	24	182246.102	468302.946	6579.388	570.952	35535.514	693234.901
Timer	5	159109.065	387777.447	23667.322	599.713	124177.060	695330.607
Timer	32	60194.238	589304.610	6176.448	275.104	32228.432	688178.832
Timer	19	122937.480	518342.104	7731.387	408.553	41672.433	691091.958
Timer	14	117765.172	442969.256	19849.724	443.428	104462.701	685490.281
Timer	20	108203.168	529161.627	8009.311	281.924	43320.310	688976.340
Timer	10	81186.363	499063.640	16709.611	279.777	86853.296	684092.688
Timer	22	126609.674	516353.114	7581.768	571.655	41083.725	692199.936
Timer	34	80312.832	576296.029	5252.064	415.060	27558.386	689834.370
Timer	35	88949.289	563356.530	5890.700	545.931	33161.699	691904.150
Timer	23	189875.043	465690.975	6054.358	622.547	32874.200	695117.124
Timer	8	78706.643	510926.490	15427.789	291.401	79937.033	685289.356
Timer	11	141243.889	443054.503	16606.129	564.815	85750.634	687219.970
Timer	9	84888.088	500973.331	15900.617	313.543	82549.751	684625.330
Timer	18	95157.213	541357.110	7966.062	165.602	42892.018	687538.005
Timer	1	177358.475	402236.309	19697.230	1334.880	103265.045	703891.938
Timer	15	62995.317	558259.037	10802.165	515.058	54284.070	686855.647
Timer	2	228236.442	313500.180	24887.463	933.521	131643.930	699201.537
Timer	0	112908.708	371233.642	41056.940	207.885	219580.684	744987.859
Timer	21	122821.680	520134.103	7515.099	527.000	40609.517	691607.399
Timer	16	67990.787	549372.248	11306.038	418.100	57625.057	686712.230
Timer	28	190540.991	465020.506	6446.193	940.650	34632.296	697580.635
Timer	17	87072.270	546874.811	9151.320	517.453	46577.786	690193.639

After

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 2606 ms
Average time: 7.730 ms
Stddev  time: 4.088 ms
Min     time: 6.000 ms
Max     time: 49.000 ms
Perf:         1831.278 GFLOP/s

	Worker	Exec task	send recv steal requests	send receive tasks	enque dequeue tesks	idle	total
Timer	14	109369.554	54669.369	4734.948	504.536	24037.765	193316.173
Timer	1	114960.762	48428.423	5055.483	908.356	24857.134	194210.158
Timer	11	110646.820	54330.393	4764.520	477.942	23372.801	193592.475
Timer	12	109773.613	54253.606	4813.124	524.362	23993.648	193358.353
Timer	28	105961.124	62151.888	4049.457	385.197	20934.560	193482.226
Timer	27	106384.220	62214.171	4098.170	430.601	20985.437	194112.600
Timer	29	106537.063	59922.506	4316.233	383.995	22539.094	193698.891
Timer	9	103471.652	61155.561	4743.777	394.963	23428.874	193194.827
Timer	24	107585.847	59358.275	4322.620	395.578	22084.356	193746.676
Timer	3	109124.665	54916.677	4961.589	563.696	24541.015	194107.642
Timer	15	98598.742	68661.446	4415.255	354.507	21791.132	193821.082
Timer	6	117904.793	46353.822	4907.058	699.744	24008.840	193874.256
Timer	10	102731.402	61727.425	4765.494	393.112	23503.577	193121.010
Timer	21	101140.485	66311.005	4125.495	307.755	21271.194	193155.934
Timer	30	106786.435	59814.345	4322.485	416.065	22394.376	193733.705
Timer	18	100103.064	69202.506	3913.947	309.698	20024.074	193553.289
Timer	34	95681.374	74938.894	3720.593	244.018	19519.236	194104.115
Timer	33	95485.940	74077.942	3730.041	258.876	19686.630	193239.429
Timer	0	106869.264	33199.361	6972.569	499.693	35513.367	183054.254
Timer	4	109051.911	54881.134	4907.741	558.008	24225.744	193624.538
Timer	19	101141.742	66311.351	4248.495	307.546	21939.490	193948.624
Timer	26	106838.971	60474.659	4231.349	438.965	21670.470	193654.415
Timer	17	100023.696	68273.456	4243.542	368.127	20953.964	193862.785
Timer	25	106151.732	59627.085	4462.027	402.793	22925.111	193568.747
Timer	16	98058.453	69144.517	4392.969	375.358	21656.227	193627.525
Timer	35	97473.807	73575.877	3740.237	257.905	19450.109	194497.935
Timer	2	128860.410	38191.806	4546.105	1424.634	21788.005	194810.960
Timer	31	95783.863	74022.193	3570.980	245.846	20479.436	194102.319
Timer	7	102732.876	61364.221	4931.404	418.762	24432.209	193879.473
Timer	5	118251.584	46295.838	4809.459	783.144	23747.035	193887.060
Timer	20	100742.140	66050.523	4239.576	299.596	22101.823	193433.658
Timer	8	102344.588	62243.333	4782.607	423.864	23509.571	193303.964
Timer	23	107622.646	59494.391	4287.997	401.905	22084.048	193890.986
Timer	13	109745.142	56445.965	4528.521	536.654	22222.765	193479.046
Timer	32	96283.180	74841.103	3635.348	260.648	18624.449	193644.728
Timer	22	101762.061	67560.103	3919.838	309.118	20105.694	193656.813

Very interesting and thorough analysis of TBB overhead vs static scheduling: http://www.cs.yale.edu/homes/abhishek/abhishek-taco11.pdf

After #74 and changing work-stealing victim selection, we can now reach 2.3 TFlops. Note: the change benefits coarse-grain parallelism but hurts fine-grain parallelism performance. Though Weave is already excellent on that front.

Interesting readings:

Acar, Charguéraud, Rainey, Scheduling Parallel Program by Work-Stealing with Private Deque, https://www.chargueraud.org/research/2013/ppopp/full.pdf
Li, Hu, Cheng, Zhao, Aynchronous Work Stealing on Distributed Memory System, www.escience.cn/system/file?fileId=79343

Perarneau, Satp, Victim Selection and Distributed Work Stealing Performance: A Case Study, https://www.researchgate.net/publication/269268740_Victim_Selection_and_Distributed_Work_Stealing_Performance_A_Case_Study

~Experiments with WV_MaxConcurrentStealPerWorker and WV_StealEarly shows that the current default offer the best performance. Changing both at the same time can even lead to a livelock with task ping-pong.~ Edit, StealEarly was not implemented, fixed in 17885dd

The idea was that workers would get a huge chunk of work upfront (similar to eager binary splitting or OpenMP scheduling) and wouldn't have to find more later.

After implementation the following configuration is stably at 2.28~2.30TFlops

LastVictim
2 MaxConcurrentSteal
1 early steal allowed

Some ideas:

Rejected

Using a private and public deque similar to Wool and Lace: https://research.utwente.nl/files/5346297/lace.pdf / https://github.com/trolando/lace as descrbied in Dinan paper Scalable Work-Stealing: https://www.researchgate.net/publication/220781636_Scalable_work_stealing

This would give up the main benefit of Weave which is being purely channel based and easier to extend with distributed computing, easier to formally verify and future hardware proof, assuming we will have in the future more cluster-on-chips and that hardware-message passing primitives will be available but memory coherency will be too expensive to maintain (i.e. very costly atomics). (for example Mellanox/Tilera 100+ARM cores, Intel SCC, Adapteva 1024 ARM cores)

Interrupts.

Interrupts might be tricky for embedding in other programs, say Python. It will also flush the cache at an inopportune time (when the core is doing carefully tuned AVX512 instructions to use 100% of the CPU cycles)

Ideas

Similar to Latency hiding work-stealing: http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-112R.pdf it might want to have another private queue for statically scheduled tasks with a static parallelFor.

Tasks sent in that queue cannot be stolen and there would be no load balancing. A "schedule: static" field could be added to the parallelFor construct.

This would allow:

to support the default guarantees of OpenMP parallel for, i.e. the same core execute the same loop iteration of a parallel region
remove scheduling overhead

This can be extended to:

have Weave work as a simple threadpool / thin wrapper around createThread/joinThread for long running processes.
support tasks pinned to a core

Both would help for IO tasks #22 with long-running event-loops on the static queue, they may spawn multithreaded computation on the work-stealing queue. Properly designing this for composition would allow not having to hardcode an event-loop in Weave, meaning it could be extended to anything that requires a long-running polling thread.

Interesting logs

When scheduling the following nested loops and trying to see if the Worker 0 distributes the i loop first (outer loop, more work sent per i iteration) or the j loop first (inner loop, less work sent per j iteration):

import
  ../weave,
  ../weave/instrumentation/loggers,
  ../weave/contexts

init(Weave)

parallelForStrided i in 0 ..< 2000, stride = 256:
  parallelForStrided j in 0 ..< 2000, stride = 128:
    captures: {i}
    log("Matrix[%d, %d] (thread %d)\n", i, j, myID())

exit(Weave)

Logs are:

Worker 14: sending own steal request to  5 (Channel 0xf4a70f90)
Worker 35: sending own steal request to 31 (Channel 0xf4a717b0)
Worker  1: sending own steal request to 35 (Channel 0xf4a718f0)
Worker  0: scheduling task.fn 0xf3aff4d4
Worker 29: sending own steal request to 20 (Channel 0xf4a71440)
Worker 16: sending own steal request to  4 (Channel 0xf4a70f40)
Worker 23: sending own steal request to 16 (Channel 0xf4a71300)
Worker  2: sending own steal request to  7 (Channel 0xf4a71030)
Worker 20: sending own steal request to 12 (Channel 0xf4a711c0)
Worker 31: sending own steal request to  2 (Channel 0xf4a70ea0)
Worker 30: sending own steal request to  9 (Channel 0xf4a710d0)
Worker 34: sending own steal request to 35 (Channel 0xf4a718f0)
Worker 32: sending own steal request to 25 (Channel 0xf4a715d0)
Worker 17: sending own steal request to 12 (Channel 0xf4a711c0)
Worker 26: sending own steal request to  9 (Channel 0xf4a710d0)
Worker  8: sending own steal request to 14 (Channel 0xf4a71260)
Worker 15: sending own steal request to  8 (Channel 0xf4a71080)
Worker 33: sending own steal request to 10 (Channel 0xf4a71120)
Worker 28: sending own steal request to 22 (Channel 0xf4a714e0)
Worker 10: sending own steal request to 23 (Channel 0xf4a71530)
Worker  7: sending own steal request to 25 (Channel 0xf4a715d0)
Worker 13: sending own steal request to 35 (Channel 0xf4a718f0)
Worker 25: sending own steal request to 29 (Channel 0xf4a71710)
Worker  9: sending own steal request to  7 (Channel 0xf4a71030)
Worker 27: sending own steal request to 35 (Channel 0xf4a718f0)
Worker  6: sending own steal request to 25 (Channel 0xf4a715d0)
Worker 24: sending own steal request to 28 (Channel 0xf4a716c0)
Worker 19: sending own steal request to 24 (Channel 0xf4a71580)
Worker  4: sending own steal request to 11 (Channel 0xf4a71170)
Worker 22: sending own steal request to  6 (Channel 0xf4a70fe0)
Worker  3: sending own steal request to 15 (Channel 0xf4a712b0)
Worker 21: sending own steal request to 16 (Channel 0xf4a71300)
Worker 11: sending own steal request to 20 (Channel 0xf4a71440)
Worker  0: preparing 1 task(s) for worker 24 with function address 0xf3aff4d4
Worker  0: sending 1 tasks (task.fn 0xf3aff4d4) to Worker 24
Worker 24: received a task with function address 0xf3aff4d4 (Channel 0x90000b60)
>>> Worker  0 enters barrier <<<
Worker  0: globalsync 1 - task from local deque
Worker 24: running task.fn 0xf3aff4d4
Worker  0: globalsync 2 - becoming a thief
Worker  0: sending own steal request to  2 (Channel 0xf4a70ea0)
Worker 12: sending own steal request to  0 (Channel 0xf4a70e00)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 22 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 22
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[256, 0] (thread 23)
Matrix[256, 128] (thread 23)
Matrix[256, 256] (thread 23)
Matrix[256, 384] (thread 23)
Matrix[256, 512] (thread 23)
Matrix[256, 640] (thread 23)
Matrix[256, 768] (thread 23)
Matrix[256, 896] (thread 23)
Matrix[256, 1024] (thread 23)
Matrix[256, 1152] (thread 23)
Matrix[256, 1280] (thread 23)
Matrix[256, 1408] (thread 23)
Matrix[256, 1536] (thread 23)
Matrix[256, 1664] (thread 23)
Matrix[256, 1792] (thread 23)
Matrix[256, 1920] (thread 23)
Worker 24: preparing 1 task(s) for worker  0 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker  0
Worker  0: received a task with function address 0xf3afede7 (Channel 0xf4a74dd0)
Worker  0: globalsync 3 - stoled tasks
Worker  0: globalsync 4 - sharing work
Worker  0: globalsync 5 - working on leftover
Worker  0: running task.fn 0xf3afede7
Matrix[512, 0] (thread 0)
Matrix[512, 128] (thread 0)
Matrix[512, 256] (thread 0)
Matrix[512, 384] (thread 0)
Matrix[512, 512] (thread 0)
Matrix[512, 640] (thread 0)
Matrix[512, 768] (thread 0)
Matrix[512, 896] (thread 0)
Matrix[512, 1024] (thread 0)
Matrix[512, 1152] (thread 0)
Matrix[512, 1280] (thread 0)
Matrix[512, 1408] (thread 0)
Matrix[512, 1536] (thread 0)
Matrix[512, 1664] (thread 0)
Matrix[512, 1792] (thread 0)
Matrix[512, 1920] (thread 0)
Worker  0: globalsync 2 - becoming a thief
Worker  0: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker  0 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker  0
Worker  0: received a task with function address 0xf3afede7 (Channel 0xf4a74dd0)
Worker  0: globalsync 3 - stoled tasks
Worker  0: globalsync 4 - sharing work
Worker  0: globalsync 5 - working on leftover
Worker  0: running task.fn 0xf3afede7
Matrix[768, 0] (thread 0)
Matrix[768, 128] (thread 0)
Matrix[768, 256] (thread 0)
Matrix[768, 384] (thread 0)
Matrix[768, 512] (thread 0)
Matrix[768, 640] (thread 0)
Matrix[768, 768] (thread 0)
Matrix[768, 896] (thread 0)
Matrix[768, 1024] (thread 0)
Matrix[768, 1152] (thread 0)
Matrix[768, 1280] (thread 0)
Matrix[768, 1408] (thread 0)
Matrix[768, 1536] (thread 0)
Matrix[768, 1664] (thread 0)
Matrix[768, 1792] (thread 0)
Matrix[768, 1920] (thread 0)
Worker  0: globalsync 2 - becoming a thief
Worker  0: sending own steal request to 24 (Channel 0xf4a71580)
Worker 23: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker  0 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker  0
Worker  0: received a task with function address 0xf3afede7 (Channel 0xf4a74dd0)
Worker  0: globalsync 3 - stoled tasks
Worker  0: globalsync 4 - sharing work
Worker  0: globalsync 5 - working on leftover
Worker  0: running task.fn 0xf3afede7
Matrix[1024, 0] (thread 0)
Matrix[1024, 128] (thread 0)
Matrix[1024, 256] (thread 0)
Matrix[1024, 384] (thread 0)
Worker 22: received a task with function address 0xf3afede7 (Channel 0xac000b60)
Worker 22: running task.fn 0xf3afede7
Matrix[1024, 512] (thread 0)
Matrix[1024, 640] (thread 0)
Matrix[1024, 768] (thread 0)
Matrix[1024, 896] (thread 0)
Matrix[1024, 1024] (thread 0)
Matrix[1024, 1152] (thread 0)
Matrix[1024, 1280] (thread 0)
Matrix[1024, 1408] (thread 0)
Matrix[1024, 1536] (thread 0)
Matrix[1024, 1664] (thread 0)
Matrix[1024, 1792] (thread 0)
Matrix[1024, 1920] (thread 0)
Worker  0: globalsync 2 - becoming a thief
Worker  0: sending own steal request to 24 (Channel 0xf4a71580)
Matrix[0, 0] (thread 22)
Matrix[0, 128] (thread 22)
Matrix[0, 256] (thread 22)
Matrix[0, 384] (thread 22)
Matrix[0, 512] (thread 22)
Matrix[0, 640] (thread 22)
Matrix[0, 768] (thread 22)
Matrix[0, 896] (thread 22)
Matrix[0, 1024] (thread 22)
Matrix[0, 1152] (thread 22)
Matrix[0, 1280] (thread 22)
Matrix[0, 1408] (thread 22)
Matrix[0, 1536] (thread 22)
Matrix[0, 1664] (thread 22)
Matrix[0, 1792] (thread 22)
Matrix[0, 1920] (thread 22)
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[1280, 0] (thread 23)
Matrix[1280, 128] (thread 23)
Matrix[1280, 256] (thread 23)
Worker 22: sending own steal request to 24 (Channel 0xf4a71580)
Matrix[1280, 384] (thread 23)
Matrix[1280, 512] (thread 23)
Matrix[1280, 640] (thread 23)
Matrix[1280, 768] (thread 23)
Matrix[1280, 896] (thread 23)
Matrix[1280, 1024] (thread 23)
Matrix[1280, 1152] (thread 23)
Matrix[1280, 1280] (thread 23)
Matrix[1280, 1408] (thread 23)
Matrix[1280, 1536] (thread 23)
Matrix[1280, 1664] (thread 23)
Matrix[1280, 1792] (thread 23)
Matrix[1280, 1920] (thread 23)
Worker 23: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[1536, 0] (thread 23)
Matrix[1536, 128] (thread 23)
Matrix[1536, 256] (thread 23)
Matrix[1536, 384] (thread 23)
Matrix[1536, 512] (thread 23)
Matrix[1536, 640] (thread 23)
Matrix[1536, 768] (thread 23)
Matrix[1536, 896] (thread 23)
Matrix[1536, 1024] (thread 23)
Matrix[1536, 1152] (thread 23)
Matrix[1536, 1280] (thread 23)
Matrix[1536, 1408] (thread 23)
Matrix[1536, 1536] (thread 23)
Matrix[1536, 1664] (thread 23)
Matrix[1536, 1792] (thread 23)
Matrix[1536, 1920] (thread 23)
Worker 23: sending own steal request to 24 (Channel 0xf4a71580)
Worker 24: scheduling task.fn 0xf3afede7
Worker 24: preparing 1 task(s) for worker 23 with function address 0xf3afede7
Worker 24: sending 1 tasks (task.fn 0xf3afede7) to Worker 23
Worker 23: received a task with function address 0xf3afede7 (Channel 0x98000b60)
Worker 23: running task.fn 0xf3afede7
Matrix[1792, 0] (thread 23)
Matrix[1792, 128] (thread 23)
Matrix[1792, 256] (thread 23)
Matrix[1792, 384] (thread 23)
Matrix[1792, 512] (thread 23)
Matrix[1792, 640] (thread 23)
Matrix[1792, 768] (thread 23)
Matrix[1792, 896] (thread 23)
Matrix[1792, 1024] (thread 23)
Matrix[1792, 1152] (thread 23)
Matrix[1792, 1280] (thread 23)
Matrix[1792, 1408] (thread 23)
Matrix[1792, 1536] (thread 23)
Matrix[1792, 1664] (thread 23)
Matrix[1792, 1792] (thread 23)
Matrix[1792, 1920] (thread 23)

The first thing we see in the logs is that when scheduling, the whole task is sent even if it's splittable https://github.com/mratsim/weave/blob/2cebb3d2a51c26c186672246d6e3da2a48bc4438/weave/scheduler.nim#L266-L289

The second thing that we see is that somehow there was no task splitting at all i.e. the following log entries are missing: https://github.com/mratsim/weave/blob/2cebb3d2a51c26c186672246d6e3da2a48bc4438/weave/victims.nim#L226-L245

Further analysis of the logs shows a promising way to close the gap with OpenMP on GEMM (and hopefully coarse-grain non-nested parallel loops as well) with minimal change (which does not undermine the potential value of pinned tasks).

Worker 16: sending own steal request to  4 (Channel 0x1333d200)
Worker 14: sending own steal request to  5 (Channel 0x1333d300)
Worker 18: sending own steal request to 32 (Channel 0x1333ee00)
Worker 30: sending own steal request to  9 (Channel 0x1333d700)
Worker  9: sending own steal request to  7 (Channel 0x1333d500)
Worker 20: sending own steal request to 12 (Channel 0x1333da00)
Worker 11: sending own steal request to 20 (Channel 0x1333e200)
Worker 31: sending own steal request to  2 (Channel 0x1333d000)
Worker 32: sending own steal request to 25 (Channel 0x1333e700)
Worker 13: sending own steal request to 35 (Channel 0x1333f100)
Worker  0: scheduling task.fn 0x121e1159 (0 pending)
>>> Worker  0 enters barrier <<<
Worker  0: globalsync 1 - task from local deque
Worker  0: running task.fn 0x121e1159
Worker  0: scheduling task.fn 0x121e0a6c (0 pending)
Worker  0: scheduling task.fn 0x121e0a6c (1 pending)
Worker  0: scheduling task.fn 0x121e0a6c (2 pending)
Worker  1: sending own steal request to 35 (Channel 0x1333f100)
Worker 22: sending own steal request to  6 (Channel 0x1333d400)
Worker 19: sending own steal request to 24 (Channel 0x1333e600)
Worker  8: sending own steal request to 14 (Channel 0x1333dc00)
Worker 28: sending own steal request to 22 (Channel 0x1333e400)
Worker 10: sending own steal request to 23 (Channel 0x1333e500)
Worker 12: sending own steal request to  0 (Channel 0x1333ce00)
Worker 35: sending own steal request to 31 (Channel 0x1333ed00)
Worker 17: sending own steal request to 12 (Channel 0x1333da00)
Worker 27: sending own steal request to 35 (Channel 0x1333f100)
Worker 24: sending own steal request to 28 (Channel 0x1333ea00)
Worker  5: sending own steal request to 22 (Channel 0x1333e400)
Worker 23: sending own steal request to 16 (Channel 0x1333de00)
Worker  6: sending own steal request to 25 (Channel 0x1333e700)
Worker 25: sending own steal request to 29 (Channel 0x1333eb00)
Worker  7: sending own steal request to 25 (Channel 0x1333e700)
Worker  3: sending own steal request to 15 (Channel 0x1333dd00)
Worker 21: sending own steal request to 16 (Channel 0x1333de00)
Worker 15: sending own steal request to  8 (Channel 0x1333d600)
Worker 33: sending own steal request to 10 (Channel 0x1333d800)
Worker  2: sending own steal request to  7 (Channel 0x1333d500)
Worker  4: sending own steal request to 11 (Channel 0x1333d900)
Worker 29: sending own steal request to 20 (Channel 0x1333e200)
Worker  0: preparing 1 task(s) for worker 12 with function address 0x121e0a6c
Worker  0: sending 1 tasks (task.fn 0x121e0a6c) to Worker 12
Worker  0: preparing 1 task(s) for worker  5 with function address 0x121e0a6c
Worker 26: sending own steal request to  9 (Channel 0x1333d700)
Worker 12: received a task with function address 0x121e0a6c (Channel 0x34000b60)
Worker 12: running task.fn 0x121e0a6c
Matrix[0, 0] (thread 12)
Matrix[0, 128] (thread 12)
Matrix[0, 256] (thread 12)
Matrix[0, 384] (thread 12)
Matrix[0, 512] (thread 12)
Matrix[0, 640] (thread 12)
Matrix[0, 768] (thread 12)
Matrix[0, 896] (thread 12)
Matrix[0, 1024] (thread 12)
Matrix[0, 1152] (thread 12)
Matrix[0, 1280] (thread 12)
Matrix[0, 1408] (thread 12)
Matrix[0, 1536] (thread 12)
Matrix[0, 1664] (thread 12)
Matrix[0, 1792] (thread 12)
Matrix[0, 1920] (thread 12)
Worker 12: sending own steal request to  0 (Channel 0x1333ce00)
Worker 34: sending own steal request to 35 (Channel 0x1333f100)
Worker  0: sending 1 tasks (task.fn 0x121e0a6c) to Worker  5
Worker  0: preparing 1 task(s) for worker 20 with function address 0x121e0a6c
Worker  0: sending 1 tasks (task.fn 0x121e0a6c) to Worker 20
Worker  5: received a task with function address 0x121e0a6c (Channel 0x54000b60)
Worker  5: running task.fn 0x121e0a6c
Matrix[256, 0] (thread 5)
...

Worker 0 starts working on the outer loop 0x121e1159 and generates multiple inner loop tasks 0x121e0a6c. When it receives a steal request, it distributes those first. Hence for nested data parallelism we are first distributing work on the inner loop similar to a Parallel Depth First scheduler. Unfortunately with a message-passing based approach that means that worker threads have to wait for the producer thread more often, increasing scheduler overhead. If we distribute the outer loop first, even with tasks pending more threads will be able to produce significant amount of work and thieves will fail to find work less often.

However for non nested-parallelism where work comes as a single loop like for the pathological histogram and logsumexp (#35) this changes nothing unless we also change the initial task enqueue for parallel loops which for for-loop is currently winner-takes-all (i.e. there is no splitting).

Another interesting log. When a child backs off, the parent will wake it and send it works when it finds it (i.e. lifeline see Saraswat, Lifeline-based Global Load Balancing, http://www.cs.columbia.edu/~martha/courses/4130/au12/p201-saraswat.pdf)

This is what happens in this log, the worker even tries to split tasks for its grandchildren but ultimately only send work to its direct child.

Worker  9: has 1 steal requests
Worker  9: found 6 steal requests addressed to its child 19 and grandchildren
Worker  9: 14 steps left (start: 0, current: 256, stop: 2000, stride: 128, 7 thieves)
Worker  9: Sending [1792, 2000) to worker 19
Worker  9: sending 1 tasks (task.fn 0x121e0a6c) to Worker 19
Worker  9: Continuing with [256, 1792)
Worker  9: waking up child 19
Matrix[1024, 128] (thread 9)
Matrix[1024, 256] (thread 9)
Matrix[1024, 384] (thread 9)
Matrix[1024, 512] (thread 9)
Matrix[1024, 640] (thread 9)
Matrix[1024, 768] (thread 9)
Matrix[1024, 896] (thread 9)
Matrix[1024, 1024] (thread 9)
Matrix[1024, 1152] (thread 9)
Matrix[1024, 1280] (thread 9)
Matrix[1024, 1408] (thread 9)
Matrix[1024, 1536] (thread 9)
Matrix[1024, 1664] (thread 9)
Worker 19: received a task with function address 0x121e0a6c (Channel 0x1c000b60)
Worker 19: running task.fn 0x121e0a6c
Matrix[1024, 1792] (thread 19)
Worker  9: sending own steal request to  0 (Channel 0x1333ce00)
Worker 15: sends state passively WAITING to its parent worker 7
Matrix[1024, 1920] (thread 19)
Worker 19: sending own steal request to  9 (Channel 0x1333d700)

Very in-depth analysis of MKL, MKL-DNN, Eigen and parallelism and threadpool in the context of deep-learning workload: https://arxiv.org/pdf/1908.04705.pdf

After #83, Weave achieves 2.53TFlops.

Some figures to keep a realistic performance target:

OpenBLAS is at 100 GFlops single-threaded
Laser / Weave are at 160 GFlops single-threaded
MKL-DNN is at 199 GFlops single-threaded
MKL is at 207 GFlops single-threaded
Theoretical maximum is 224 GFlops.

Perfect multithreaded scaling on 18 cores would be 2.88 TFlops, reaching over 3TFlops would need significantly improving single-threaded GEMM kernel which is out-of-scope for Weave and should be tackled in Laser. Perfect scaling is probably not possible due to Hyperthreading thrashing the cache and TLB of sibling threads but achieving Laser OpenMP performance of 2.7 TFlops is a nice target.

Theoretical maximum

is computed by multiplying

Component	Multiplier	Cumulated	Comment
CPU Frequency	3.5 billions	3.5 GFlops	Overclocked i9-9980XE with 3.5GHz AVX512 all core turbo
FMA - Fused Multiply Add per cycle	2	7.0 GFlops	(i9 and Xeon Gold/Platinum have 2x AVX512 per core while Xeon Bronze and Silver only have 1x)
Instr per FMA	2	14.0 GFlops	1 FMA is 1 add + 1 mul
Vector width	16	224.0 GFlops	AVX512 SIMD processes 16 float32 at a time

It seems like without Backoff, we actually do reach the 2.7 TFlops

$  ./build/weave_master_no_backoff 

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 2002 ms
Average time: 5.277 ms
Stddev  time: 0.954 ms
Min     time: 5.000 ms
Max     time: 15.000 ms
Perf:         2682.712 GFLOP/s

Closed by #94

We can reach over 2.8TFlops from a 160~170GFlops single threaded baseline (17.5x speedup) while MKL reaches 3.4TFlops from a 210 GFlops baseline (16.2x) and OpenBLAS 2.8 TFlops from a 193GFlops baseline (14.5) and MKL-DNN 3.0TFlops from a 200 GFlops baseline (15x speedup).

So the issue on multithreading front and perf vs Laser/OpenMP is solved and even with the slower kernel, the performance is very good on high-core-count machines.

Now there is still an issue with nesting. The new code still uses syncRoot which prevents nesting in other parallel region (like batched matmul). A simple nestable kernel would await/sync the outer and inner loop but it seems that it leads to load imbalance as the performance reached is at most 1.8TFlops.

Maybe the kernel could be redesigned to take full advantage of the dataflow approach but it's out of scope of this issue. https://github.com/mratsim/weave/blob/85b90b3f95c75615629aeea34254c47c0de643f8/benchmarks/matmul_gemm_blas/gemm_pure_nim/gemm_weave.nim#L154-L188

mratsim / weave