Closed TimothyGu closed 2 years ago
Addressed all the comments. PTAL
Okay, this should now work in a distributed setting. Will be testing it on sapling in the coming days. I also made sure to set the number of TBLIS threads to omp_set_max_threads()
as suggested by the cuNumeric folks, so now it should autoscale with -ll:othr
and no longer hang.
Rebased.
Here are some final benchmarks (using rank-per-socket except for the single node case):
$ bin/chemTest -n 70 -tblis -gx 2 -gy 1 -ll:ocpu 2 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
Execution time: 463 ms.
Execution time: 437 ms.
Execution time: 428 ms.
Execution time: 429 ms.
Execution time: 427 ms.
Execution time: 433 ms.
Execution time: 433 ms.
Execution time: 429 ms.
Execution time: 431 ms.
Execution time: 432 ms.
$ mpirun -H c0001:2,c0002:2 --bind-to socket bin/chemTest -n 83 -tblis -gx 2 -gy 2 -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
[3 - 7f73eda98d00] 0.000185 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
Execution time: 897 ms.
Execution time: 731 ms.
Execution time: 697 ms.
Execution time: 684 ms.
Execution time: 700 ms.
Execution time: 739 ms.
Execution time: 717 ms.
Execution time: 690 ms.
Execution time: 725 ms.
Execution time: 693 ms.
$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket bin/chemTest -n 99 -tblis -gx 4 -gy 2 -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
[5 - 7fcbca44ed00] 0.000215 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
Execution time: 1147 ms.
Execution time: 1049 ms.
Execution time: 1004 ms.
Execution time: 1048 ms.
Execution time: 1024 ms.
Execution time: 1022 ms.
Execution time: 1031 ms.
Execution time: 1104 ms.
Execution time: 992 ms.
Execution time: 1058 ms.
This is much faster than CTF in all cases (almost 2× improvement for the 4-node case):
$ mpirun -H c0002:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 70
Execution time: 905 ms.
Execution time: 784 ms.
Execution time: 745 ms.
Execution time: 745 ms.
Execution time: 739 ms.
Execution time: 731 ms.
Execution time: 744 ms.
Execution time: 745 ms.
Execution time: 730 ms.
Execution time: 728 ms.
$ mpirun -H c0001:20,c0002:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 83
Execution time: 968 ms.
Execution time: 926 ms.
Execution time: 984 ms.
Execution time: 916 ms.
Execution time: 933 ms.
Execution time: 985 ms.
Execution time: 952 ms.
Execution time: 887 ms.
Execution time: 970 ms.
Execution time: 933 ms.
$ mpirun -H c0001:20,c0002:20,c0003:20,c0004:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 99
Execution time: 1897 ms.
Execution time: 2058 ms.
Execution time: 1981 ms.
Execution time: 2038 ms.
Execution time: 2019 ms.
Execution time: 2148 ms.
Execution time: 2066 ms.
Execution time: 2046 ms.
Execution time: 2085 ms.
Looks good, great work Tim!
The TBLIS leaf call transform essentially transforms any binary contractions of form:
to a call to
tblis::mult
. Thetblis::mult
function takes an einsum string as input, and does the contraction in an optimal manner that doesn't require us to explicit figure out the sequence of GEMMs and reductions.