The only part that needs to be single threaded is the execution on the device. Compilation (and the baseline 'golden' calculation) can be multi-threaded (or even offline in for the baseline calculation). Compilation can be natively multi-threaded using the approach that Nirvedh recently introduced of packing multiple kernels into a single vmfb as in this test.
Another approach definitely worth considering is computing the cpu_comparison results and storing the correct output in the test directory. For large 'golden' baselines (example: 1000x1000x1000 matmul has 1e6 values), rather than store 1e6 baseline values, store a projection (or subset) of them.
Build more comprehensive tests for the default pipeline for matmuls.
These additional matmul should encompass more sizes (M, N, K > 1e6, M mod 2 != 0, etc), and all variants of transposition.
See issues transpose matmul and odd matmuls
Bonus points: currently the tests with end-to-end matmuls (in run_matmul_test.sh and cpu_comparison are single-threaded.
The only part that needs to be single threaded is the execution on the device. Compilation (and the baseline 'golden' calculation) can be multi-threaded (or even offline in for the baseline calculation). Compilation can be natively multi-threaded using the approach that Nirvedh recently introduced of packing multiple kernels into a single vmfb as in this test.
Another approach definitely worth considering is computing the cpu_comparison results and storing the correct output in the test directory. For large 'golden' baselines (example: 1000x1000x1000 matmul has 1e6 values), rather than store 1e6 baseline values, store a projection (or subset) of them.