rohany / taco

The Tensor Algebra Compiler (taco) computes sparse tensor expressions on CPUs and GPUs
http://tensor-compiler.org
Other
6 stars 3 forks source link

Add a leaf call transform for TBLIS #119

Closed TimothyGu closed 2 years ago

TimothyGu commented 2 years ago

The TBLIS leaf call transform essentially transforms any binary contractions of form:

ForAll(i, ForAll(j, ForAll(k, … ForAll(z, A[idx1] = B[idx2] * C[idx3]) … )))

to a call to tblis::mult. The tblis::mult function takes an einsum string as input, and does the contraction in an optimal manner that doesn't require us to explicit figure out the sequence of GEMMs and reductions.

TimothyGu commented 2 years ago

Addressed all the comments. PTAL

TimothyGu commented 2 years ago

Okay, this should now work in a distributed setting. Will be testing it on sapling in the coming days. I also made sure to set the number of TBLIS threads to omp_set_max_threads() as suggested by the cuNumeric folks, so now it should autoscale with -ll:othr and no longer hang.

TimothyGu commented 2 years ago

Rebased.

Here are some final benchmarks (using rank-per-socket except for the single node case):

$ bin/chemTest -n 70 -tblis -gx 2 -gy 1 -ll:ocpu 2 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
Execution time: 463 ms.
Execution time: 437 ms.
Execution time: 428 ms.
Execution time: 429 ms.
Execution time: 427 ms.
Execution time: 433 ms.
Execution time: 433 ms.
Execution time: 429 ms.
Execution time: 431 ms.
Execution time: 432 ms.
$ mpirun -H c0001:2,c0002:2 --bind-to socket bin/chemTest -n 83 -tblis -gx 2 -gy 2 -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
[3 - 7f73eda98d00]    0.000185 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
Execution time: 897 ms.
Execution time: 731 ms.
Execution time: 697 ms.
Execution time: 684 ms.
Execution time: 700 ms.
Execution time: 739 ms.
Execution time: 717 ms.
Execution time: 690 ms.
Execution time: 725 ms.
Execution time: 693 ms.
$ mpirun -H c0001:2,c0002:2,c0003:2,c0004:2 --bind-to socket bin/chemTest -n 99 -tblis -gx 4 -gy 2 -ll:ocpu 1 -ll:othr 9 -ll:util 1 -ll:nsize 10G -ll:ncsize 0
[5 - 7fcbca44ed00]    0.000215 {4}{threads}: reservation ('dedicated worker (generic) #2') cannot be satisfied
Execution time: 1147 ms.
Execution time: 1049 ms.
Execution time: 1004 ms.
Execution time: 1048 ms.
Execution time: 1024 ms.
Execution time: 1022 ms.
Execution time: 1031 ms.
Execution time: 1104 ms.
Execution time: 992 ms.
Execution time: 1058 ms.

This is much faster than CTF in all cases (almost 2× improvement for the 4-node case):

$ mpirun -H c0002:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 70
Execution time: 905 ms.
Execution time: 784 ms.
Execution time: 745 ms.
Execution time: 745 ms.
Execution time: 739 ms.
Execution time: 731 ms.
Execution time: 744 ms.
Execution time: 745 ms.
Execution time: 730 ms.
Execution time: 728 ms.
$ mpirun -H c0001:20,c0002:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 83
Execution time: 968 ms.
Execution time: 926 ms.
Execution time: 984 ms.
Execution time: 916 ms.
Execution time: 933 ms.
Execution time: 985 ms.
Execution time: 952 ms.
Execution time: 887 ms.
Execution time: 970 ms.
Execution time: 933 ms.
$ mpirun -H c0001:20,c0002:20,c0003:20,c0004:20 env LD_LIBRARY_PATH='/scratch2/tigu/taco/ctf/scalapack/build/lib:/scratch2/tigu/taco/deps/openblas/lib' ctf/bin/chemtest -n 99
Execution time: 1897 ms.
Execution time: 2058 ms.
Execution time: 1981 ms.
Execution time: 2038 ms.
Execution time: 2019 ms.
Execution time: 2148 ms.
Execution time: 2066 ms.
Execution time: 2046 ms.
Execution time: 2085 ms.
rohany commented 2 years ago

Looks good, great work Tim!