More comprehensive matmul testing

Build more comprehensive tests for the default pipeline for matmuls.

These additional matmul should encompass more sizes (M, N, K > 1e6, M mod 2 != 0, etc), and all variants of transposition.

See issues transpose matmul and odd matmuls

Bonus points: currently the tests with end-to-end matmuls (in run_matmul_test.sh and cpu_comparison are single-threaded.

The only part that needs to be single threaded is the execution on the device. Compilation (and the baseline 'golden' calculation) can be multi-threaded (or even offline in for the baseline calculation). Compilation can be natively multi-threaded using the approach that Nirvedh recently introduced of packing multiple kernels into a single vmfb as in this test.

Another approach definitely worth considering is computing the cpu_comparison results and storing the correct output in the test directory. For large 'golden' baselines (example: 1000x1000x1000 matmul has 1e6 values), rather than store 1e6 baseline values, store a projection (or subset) of them.

nod-ai / iree-amd-aie

More comprehensive matmul testing #432