Communication-Avoiding 3D Matrix Multiplication
Clone the CA3DMM library from GitHub:
git clone https://github.com/scalable-matrix/CA3DMM.git
Enter directory CA3DMM/src
. CA3DMM provides 4 example make files in CA3DMM/src
:
icc-mkl-impi.make
: Use Intel C compiler, Intel MKL, and Intel MPI libraryicc-mkl-anympi.make
: Use Intel C compiler, Intel MKL, and any MPI libraryicc-mkl-nvcc-impi.make
: Use Intel C compiler, Intel MKL, NVCC compiler for NVIDIA GPU support, and Intel MPI libraryicc-mkl-nvcc-anympi.make
: Use Intel C compiler, Intel MKL, NVCC compiler for NVIDIA GPU support, and any MPI library
We use the icc-mkl-anympi.make
and icc-mkl-nvcc-anympi.make
make files for compiling the CPU and GPU versions on the Georgia Tech PACE-Phoenix cluster. Run the following command to compile the CA3DMM library:
make -f icc-mkl-anympi.make -j
After compilation, the dynamic and static library files are copied to directory CA3DMM/lib
, and the C header files are copied to directory CA3DMM/include
.
Enter directory CA3DMM/examples
to compile the example program. This directory also contains 4 example make files as those in CA3DMM/src
. Run the following command to compile the CA3DMM library:
make -f icc-mkl-anympi.make -j
After compilation, executable CA3DMM/examples/example_AB.exe
is the example program we needed.
For single node execution or launching on clusters without a job scheduling system, the following command should run on most platforms (assuming that you are in directory CA3DMM/examples
):
mpirun -np <nprocs> ./example_AB.exe <M> <N> <K> <transA> <transB> <validation> <ntest> <dev_type>
Where:
To explicitly control MPI + OpenMP hybrid parallelization, you need to specify OpenMP environment variables, and process affinity environment variables for some MPI libraries. In the paper, we use the following environment variables for MPI + OpenMP parallel tests on the Georgia Tech PACE-Phoenix cluster:
OMP_PLACES=cores
OMP_NUM_THREADS=24
MV2_CPU_BINDING_POLICY=hybrid
MV2_THREADS_PER_PROCESS=24
For clusters and supercomputers with job scheduling systems like slurm, you need to write job scripts for launching the example program on multiple nodes.
The example program prints timing results to the screen output. Here is an example running output on a single node using 1 core per MPI process:
$ mpirun -np 24 ./example_AB.exe 8000 8000 8000 0 0 1 10 0
Test problem size m * n * k : 8000 * 8000 * 8000
Transpose A / B : 0 / 0
Number of tests : 10
Check result correctness : 1
Device type : 0
CA3DMM partition info:
Process grid mp * np * kp : 4 * 2 * 3
Work cuboid mb * nb * kb : 2000 * 4000 * 2667
Process utilization : 100.00 %
Comm. volume / lower bound : 1.04
Rank 0 work buffer size : 244.28 MBytes
A, B, C redist : 80 79 77 79 80 77 82 78 78 77
A / B allgather : 16 16 16 16 16 16 16 16 16 16
2D Cannon : 649 649 667 665 664 667 666 666 667 667
C reduce-scatter : 41 42 42 51 43 42 41 42 42 41
matmul only : 706 708 725 733 724 725 724 724 725 725
total execution : 786 786 802 812 803 802 806 802 803 801
================== CA3DMM algorithm engine =================
* Initialization : 4.89 ms
* Number of executions : 10
* Execution time (avg) : 800.46 ms
* Redistribute A, B, C : 78.74 ms
* Allgather A or B : 16.30 ms
* 2D Cannon execution : 662.62 ms
* Reduce-scatter C : 42.79 ms
--------------- 2D Cannon algorithm engine ---------------
* Initialization : 0.04 ms
* Number of executions : 10
* Execution time (avg) : 662.62 ms
* Initial shift : 33.79 ms
* Loop shift wait : 23.21 ms
* Local DGEMM : 605.62 ms
* Per-rank performance : 64.40 GFlops
----------------------------------------------------------
============================================================
CA3DMM output : 0 error(s)
The example program uses 1D column partition for the input A and B matrices and the output C matrix. The "Process grid" line shows the 3D process grid size. The "matmul only" line gives the runtime (in milliseconds) using library-native matrix partitioning.
If you use CA3DMM in your work, please cite our SC22 paper:
@inproceedings{huang2022ca3dmm,
title = { {CA3DMM}: A New Algorithm Based on a Unified View of Parallel Matrix Multiplication },
booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
publisher = {IEEE},
author = {Huang, Hua and Chow, Edmond},
doi = {10.5555/3571885.3571922},
isbn = {978-4-6654-5444-5},
address = {Dallas, TX, USA},
month = {Nov},
year = {2022},
pages = {381--395},
}
Thank you very much for using our code!