Open lixiaolx opened 1 year ago
Hi, Thank you for your interest.
All the results were done on 16 GPU DGX-2 system. So, they might not be reproducible on 8 GPUs.
I got the results with CHANNELS as 24 and BUFFSIZE as 750KB or 1.5MB.
@abhijangda hello,
Why do the kernel time and allreduce time after using overlap take longer than the original single kernel without overlapping?
Why does the performance get worse when using overlap? Will this happen theoretically? I understand that the worst is the same as the original non-overlap time, right?
Can you print the table of results you get for all NCCL_MIN_NCANNELS, NCCL_MAX_NCHANNELS (MAX and MIN NCHANNELS should be same) and NCCL_BUFFSIZE here. It stalls when NCCL_MAX_NCHANNELS NCCL_BUFFSIZE NUM_GPUs < the matrix size. So, for same cases it might stall for larger sizes but not for smaller sizes. For 750KB BUFFSIZE and 24 CHANNELS, it should not stall for the smallest size because: 24 750 1024 8 >= 81024*3072
About the overlap being worst performing, I am not sure what is happening, what is your system? are you using NVLINK?
@abhijangda hello,Thank you,I can test the test MNK of the script. Now, I want to use the sample script below nccl-ovelap to test other MNKs (such as M=64, N=9216, K=1536 or K=4608), do I need to make any changes? When I directly use the script test, by adjusting the channel and buf_size, I encountered two kinds of errors,
one is hang,
the other is memory-related errors,
like:
free(): corrupted unsorted chunks free(): corrupted unsorted chunks
malloc(): invalid size (unsorted) malloc(): invalid size (unsorted) malloc(): invalid size (unsorted)
I have commented on why the code hangs.
Can you share your code, to find the issue for malloc/free issues.
The command I use is and will hang. make matmul-allreduce-release && mpirun -np 8 --allow-run-as-root -x LD_LIBRARY_PATH="../build/lib/:/usr/local/cuda/lib64/:$LD_LIBRARY_PATH" -x NCCL_PROTO=Simple -x NCCL_ALGO=Ring -x NCCL_DEBUG=INFO -x NCCL_MIN_NCHANNELS=12 -x NCCL_MAX_NCHANNELS=12 -x NCCL_NTHREADS=512 -x NCCL_BUFFSIZE=49152 ./matmul-allreduce
My question: why need NCCL_MAX_NCHANNELS * NCCL_BUFFSIZE * NUM_GPUs > the matrix size.
I find my log is matrixSize 196608 chunkSize 12288 nranks * loopSize 1179648 matrixSize 196608 chunkSize 12288 nranks * loopSize 1179648 matrixSize 196608 chunkSize 12288 nranks * loopSize 1179648 matrixSize 196608 chunkSize 12288 nranks * loopSize 1179648 matrixSize 196608 chunkSize 12288 nranks * loopSize 1179648 matrixSize 196608 chunkSize 12288 nranks * loopSize 1179648 matrixSize 196608 chunkSize 12288 nranks * loopSize 1179648 matrixSize 196608 chunkSize 12288 nranks * loopSize 1179648
but,also occur hang.
I use the code is nccl-overlap sample and change the code is
https://github.com/parasailteam/coconet/blob/52175e2b60134d8becd0b4008a888865e589dd0a/nccl-overlap/samples/matmul-allreduce.cu#L1073
when I use other channel and buffer_size ,will occur error for example:
make matmul-allreduce-release && mpirun -np 4 --allow-run-as-root -x LD_LIBRARY_PATH="../build/lib/:/usr/local/cuda/lib64/:$LD_LIBRARY_PATH" -x NCCL_PROTO=Simple -x NCCL_ALGO=Ring -x NCCL_DEBUG=INFO -x NCCL_MIN_NCHANNELS=3 -x NCCL_MAX_NCHANNELS=3 -x NCCL_NTHREADS=512 -x NCCL_BUFFSIZE=196608
Sorry for late reply.
My question: why need NCCL_MAX_NCHANNELS NCCL_BUFFSIZE NUM_GPUs > the matrix size.
There is nothing fundamental but we didn't get time to implement this case and our best performance were obtained when this condition is met. We are working on a new version that is more robust but it will take time.
I use the code is nccl-overlap sample and change the code is
Can you provide a minimum working example that I can use to debug your issue with a command line parameters used to invoke it.
@abhijangda
Can you provide a minimum working example that I can use to debug your issue with a command line parameters used to invoke it.
here is my test code, changed by your nccl-overlap/sample/matmul-allreduce.cu when I change the MNK as follows(the following code at line: 215-218) first, second, three fourth , I find will hang at https://github.com/parasailteam/coconet/blob/main/nccl-overlap/src/collectives/device/all_reduce.h#L54. and I print the log as
// first
int M = 9216;
int N= 64;
int K =1152;
int chunkCols = 64;
hang log:
rank 0 chunkIndex 9 cy 3456 cx 0 m 384 n 64 numTiles 1 totalNumTiles 9 combinedChunks 3 blockIdx.x 0 realChunkRows 384 tileStatus[chunkIndex/combinedChunks]: 36 (iteration + 1)* totalNumTiles: 9
// second ,change the MNK
int M = 64;
int N= 9216;
int K =4608;
int chunkCols = 1024;
hang log:
rank 0 chunkIndex 9 cy 24 cx 0 m 24 n 1024 numTiles 24 totalNumTiles 24 combinedChunks 3 blockIdx.x 0 realChunkRows 24 tileStatus[chunkIndex/combinedChunks]: 52 (iteration + 1)* totalNumTiles: 24
// third
int M = 8;
int N= 9216;
int K =4608;
int chunkCols = 1024;
core:
Signal: Segmentation fault (11)
Signal code: Invalid permissions (2)
Failing at address: 0x60000001b
// fourth
int M = 9216;
int N= 8;
int K =4608;
int chunkCols = 8;
core:
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x45
run with
make matmul-allreduce-release && mpirun -np 8 --allow-run-as-root -x LD_LIBRARY_PATH="../build/lib/:/usr/local/cuda/lib64/:$LD_LIBRARY_PATH" -x NCCL_PRSimple -x NCCL_ALGO=Ring -x NCCL_DEBUG=INFO -x NCCL_MIN_NCHANNELS=3 -x NCCL_MAX_NCHANNELS=3 -x NCCL_NTHREADS=512 -x NCCL_BUFFSIZE=1179648 ./matmul-allreduce
#include "header.h"
#include "cutlass-matmul.h"
#include <cuda_profiler_api.h>
#include <map>
void pipe_rowmajorABC(cublasHandle_t handle, const half *alpha, const half *beta, const half* m1, const half* m2, half* m1m2, ncclComm_t comm, cudaStream_t stream, int M, int N, int K, float& allReduceTime, float& cublasTime) {
cudaEvent_t startpipe, stoppipe;
float elapsedTime = 0;
CUDACHECK(cudaEventCreate(&startpipe));
CUDACHECK(cudaEventCreate(&stoppipe));
CUDACHECK(cudaEventRecord(startpipe, stream));
CUBLASCHECK(cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N,
N, M, K,
alpha,
m2, CUDA_R_16F, N,
m1, CUDA_R_16F, K,
beta,
m1m2, CUDA_R_16F, N,
CUDA_R_16F, CUBLAS_GEMM_DFALT_TENSOR_OP));
CUDACHECK(cudaEventRecord(stoppipe, stream));
CUDACHECK(cudaStreamSynchronize(stream));
CUDACHECK(cudaEventSynchronize(stoppipe));
CUDACHECK(cudaEventElapsedTime(&elapsedTime, startpipe,stoppipe));
cublasTime += elapsedTime;
elapsedTime = 0;
double t1 = getCurrentTime();
NCCLCHECK(ncclAllReduceMatrix(m1m2, M*N, M, N, N, ncclHalf, ncclSum, comm, stream));
CUDACHECK(cudaStreamSynchronize(stream));
double t2 = getCurrentTime();
allReduceTime += (t2-t1)*1000.0f;
}
bool mpiRef(const float* m1, const float* m2, float* m1m2, int M, int N, int K, int comm_size, int rank = -1)
{
for (size_t i0 = 0; i0 < M*N; i0++) {
float ref = K*comm_size;
if (!eqFloat(ref, m1m2[i0])) {
printf("rankk %d Mismatch at %ld : ref '%f', computed '%f'\n",rank, i0, ref, m1m2[i0]);
return false;
}
}
return true;
}
template<typename T>
std::vector<std::vector<std::tuple<int, int, int, int>>> getChunkBlocks(int rank, size_t matrixSize, int nranks, int* rings, int MATMUL_M, int MATMUL_N,
const int realChunkCols, int& maxRealChunkRows) {
std::vector<std::vector<std::tuple<int, int, int, int>>> chunkBlocks;
assert (atoi(getenv ("NCCL_MIN_NCHANNELS")) == atoi(getenv ("NCCL_MAX_NCHANNELS")));
int nChannels = atoi(getenv ("NCCL_MIN_NCHANNELS"));
int nThreads = atoi(getenv("NCCL_NTHREADS"));
int channelBuffSize = atoi(getenv("NCCL_BUFFSIZE"));
const int stepSize = channelBuffSize / (sizeof(T)*NCCL_STEPS);
const size_t chunkSize = stepSize * ALLREDUCE_CHUNKSTEPS;
const ssize_t loopSize = nChannels*(ssize_t)chunkSize;
maxRealChunkRows = 0;
printf("matrixSize %d chunkSize %d nranks * loopSize %d\n", matrixSize, chunkSize, nranks * loopSize);
for (int userRank = nranks - 1; userRank >= 0; userRank--) {
chunkBlocks.push_back(std::vector<std::tuple<int, int, int, int>>());
int combinedRanks = 1;
for (int channel = 0; channel < nChannels; channel++) {
//TODO: following loop only run for once right now.
for (size_t gridOffset = 0; gridOffset < matrixSize; gridOffset += nranks * loopSize) {
size_t realChunkSize = min(chunkSize, DIVUP(matrixSize-gridOffset,nranks*nChannels));
if (matrixSize %3 == 0 && MATMUL_N != 12288) {
ALIGN_SIZE(realChunkSize, nThreads*sizeof(uint64_t)/sizeof(T) * 3);
} else
if (matrixSize % 12288 == 0) {
ALIGN_SIZE(realChunkSize, nThreads*sizeof(uint64_t)/sizeof(T) * 12);
}
else {
ALIGN_SIZE(realChunkSize, nThreads*sizeof(uint64_t)/sizeof(T));
}
const int realChunkRows = realChunkSize/realChunkCols;
const int gridOffsetStartRow = gridOffset / MATMUL_N;
maxRealChunkRows = std::max (maxRealChunkRows, realChunkRows);
int chunkIdx = rings[channel*nranks + userRank] * nChannels + channel;
int chunkStartRow = gridOffsetStartRow + chunkIdx / (MATMUL_N / realChunkCols) * realChunkRows;
int chunkStartCol = chunkIdx % (MATMUL_N / realChunkCols) * realChunkCols;
int nelem = min(realChunkSize, (matrixSize - (chunkStartRow * MATMUL_N + (MATMUL_M - chunkStartRow) * (MATMUL_N - (MATMUL_N - chunkStartCol)))));
int chunkRows = min(min(nelem/realChunkCols, realChunkRows), MATMUL_M - chunkStartRow);
int chunkCols;
chunkCols = realChunkCols;
nelem = chunkCols * chunkRows;
chunkBlocks[chunkBlocks.size() - 1].push_back(std::make_tuple(chunkStartRow, chunkStartCol, chunkRows, chunkCols));
}
}
}
return chunkBlocks;
}
#define MAX_CHANNELS 80
int main(int argc, char** argv){
const int N_GPUS = 16;
MPI_Init(&argc, &argv); int comm_size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
ncclComm_t comm;
CUDACHECK(cudaSetDevice(rank % N_GPUS));
//initializing NCCL
ncclUniqueId id;
if (rank == 0) ncclGetUniqueId(&id);
MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD);
ncclCommInitRank(&comm, comm_size, id, rank);
int ringLength;
int nChannels;
int* rings = new int[MAX_CHANNELS * comm_size];
getNCCLRing(&comm, rings, ringLength, nChannels);
for (int _rank = 0; _rank < comm_size; _rank++) {
if (_rank != rank) continue;
std::cout << "rank: " << rank << ":";
for (int i = 0; i < ringLength; i++) {
std::cout << rings[i] << "->";
}
std::cout << std::endl;
MPI_Barrier(MPI_COMM_WORLD);
}
// std::cout << "ncclChannel buffsize " << comm.channels[0] << std::endl;
int epochs = 10;
cudaStream_t stream;
int leastStreamPriority = 0, highestStreamPriority = 0;
CUDACHECK(cudaDeviceGetStreamPriorityRange(&leastStreamPriority, &highestStreamPriority));
printf("highestStreamPriority %d\n", highestStreamPriority);
cudaStreamCreateWithPriority(&stream, cudaStreamDefault, highestStreamPriority);
cudaStream_t cutlassStream;
cudaStreamCreateWithPriority(&cutlassStream, cudaStreamDefault, leastStreamPriority);
cublasHandle_t handle;
CUBLASCHECK(cublasCreate(&handle));
CUBLASCHECK(cublasSetStream(handle, stream));
CUBLASCHECK(cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH));
cublasHandle_t handleWithCutlassStream;
CUBLASCHECK(cublasCreate(&handleWithCutlassStream));
CUBLASCHECK(cublasSetStream(handleWithCutlassStream, cutlassStream));
CUBLASCHECK(cublasSetMathMode(handleWithCutlassStream, CUBLAS_TENSOR_OP_MATH));
half* dAlpha, *dBeta;
half alpha = __float2half(1.0);
CUDACHECK(cudaMalloc(&dAlpha, sizeof(half)));
CUDACHECK(cudaMemcpy(dAlpha, &alpha, sizeof(half), cudaMemcpyHostToDevice));
CUDACHECK(cudaMalloc(&dBeta, sizeof(half)));
half beta = __float2half(0);
CUDACHECK(cudaMemcpy(dBeta, &beta, sizeof(half), cudaMemcpyHostToDevice));
CUBLASCHECK(cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE));
CUBLASCHECK(cublasSetPointerMode(handleWithCutlassStream, CUBLAS_POINTER_MODE_DEVICE));
MPI_Barrier(MPI_COMM_WORLD);
nChannels = atoi(getenv ("NCCL_MIN_NCHANNELS"));
#define GPT2_PARAMS
#ifdef GPT2_PARAMS
int SEQUENCE_LENGTH = 1024;
// int MODEL_PARALLEL_GPUS[] = {1, 2, 4, 8, 16};
// float MODEL_PARAMS[] = {0.345, 1.2, 2.5, 4.2, 8.3};
int BATCH_SIZE[] = {8, 16, 32, 64};
// int BATCH_SIZE[] = {32, 64, 512, 1024, 2048};
int HIDDEN_DIMENSIONS[] = {/*345M Model*/ 4096, /*1.2B Model is 1536*/ 4096, /*2.5B Model is 1920*/ 4096,
/*4.2B is 2304*/ 4096};
int HIDDEN_DIMENSIONS_12CHANNELS[] = {3072, /*345M Model*/ 3072, /*1.2B Model is 1536*/ 3072, /*2.5B Model is 1920*/ 3072,
/*4.2B is 2304*/ 3072};
int MODEL_PARALLEL_GPUS[] = {16, 16, 16, 16};
float MODEL_PARAMS[] = {8.3, 8.3, 8.3, 8.3, 8.3};
#else
int SEQUENCE_LENGTH = 2048;
// int MODEL_PARALLEL_GPUS[] = {1, 2, 4, 8, 16};
// float MODEL_PARAMS[] = {0.345, 1.2, 2.5, 4.2, 8.3};
int BATCH_SIZE[] = {1, 2, 4, 6};
// int BATCH_SIZE[] = {32, 64, 512, 1024, 2048};
int HIDDEN_DIMENSIONS[] = {/*345M Model*/ 12288, /*1.2B Model is 1536*/ 12288, /*2.5B Model is 1920*/ 12288, 12288};
int HIDDEN_DIMENSIONS_12CHANNELS[] = {/*345M Model*/ 12288, /*1.2B Model is 1536*/ 12288, /*2.5B Model is 1920*/ 12288, 12288};
int MODEL_PARALLEL_GPUS[] = {16, 16, 16, 16};
float MODEL_PARAMS[] = {137, 137, 137, 137};
#endif
//sizeof(HIDDEN_DIMENSIONS)/sizeof(HIDDEN_DIMENSIONS[0])
for (int model = 0; model < 1; model++) {
for (int matMulType = 1; matMulType < 2; matMulType++) {
// int M = BATCH_SIZE[model] * SEQUENCE_LENGTH;
// int N = (nChannels%3 == 0) ? HIDDEN_DIMENSIONS_12CHANNELS[model] : HIDDEN_DIMENSIONS[model];
// int K = N/MODEL_PARALLEL_GPUS[model] * ((matMulType == 0) ? 1 : 4);
int M = 9216;
int N= 8;
int K =4608;
int chunkCols = 8;
if (rank == 0)
printf("Model Size %.2f B Params , MatMul: [%d, %d] X [%d, %d]\n", MODEL_PARAMS[model], M, K, K, N);
// if (comm_size != MODEL_PARALLEL_GPUS[model])
// continue;
// Inputs
half* m1;
CUDACHECK(cudaMalloc(&m1, M*K * sizeof(half)));
// cudaMemRandInt(m1, M*K);
memset_value(m1, __float2half(1.0f), M*K);
half* m2;
CUDACHECK(cudaMalloc(&m2, K*N * sizeof(half)));
// cudaMemRandInt(m2, K*N);
memset_value(m2, __float2half(1.0f), K*N);
half* m1m2;
CUDACHECK(cudaMalloc(&m1m2, M*N* sizeof(half)));
half* _m1m2;
CUDACHECK(cudaMalloc(&_m1m2, M*N* sizeof(half)));
half* __m1m2;
CUDACHECK(cudaMalloc(&__m1m2, M*N* sizeof(half)));
MPI_Barrier(MPI_COMM_WORLD);
float totalTime = 0;
float cublasTime = 0;
float allReduceTime = 0;
float matmulTime = 0;
#define CUBLAS_BASELINE
#define CUSTOM_BASELINE
#ifdef CUBLAS_BASELINE
for(int iter = 0; iter < 110; iter++) {
if (rank == 0 and iter % 20 == 0)
printf("iter %d\n", iter);
cudaEvent_t startpipe, stoppipe;
float elapsedTimepipe;
float __allReduceTime = 0.0f, __cublasTime = 0.0f;
// MPI_Barrier(MPI_COMM_WORLD);
double t1 = getCurrentTime();
// if (rank == 0)
// printf("executiing\n");
pipe_rowmajorABC(handle, dAlpha, dBeta, m1, m2, m1m2, comm, stream, M, N, K, __allReduceTime, __cublasTime);
double t2 = getCurrentTime();
// if (rank == 0)
// printf("executiing done\n");
if (iter >= 10) {
totalTime += (t2-t1)*1000.0f;
allReduceTime += __allReduceTime;
cublasTime += __cublasTime;
}
// MPI_Barrier(MPI_COMM_WORLD);
if (iter == 0)
{
float *hm1 = new float[M*K];
float *hm2 = new float[N*K];
float *hm1m2 = new float[M*N];
cudaMemcpyHalfDevice2FloatHost(hm1, m1, M*K);
cudaMemcpyHalfDevice2FloatHost(hm2, m2, N*K);
cudaMemcpyHalfDevice2FloatHost(hm1m2, m1m2, M*N);
if (rank == 0)
printf("checking results at iter %d \n", iter);
if (!mpiRef(hm1, hm2, hm1m2, M, N, K, comm_size))
assert(false);
}
}
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0)
printf("AllReduce+cuBLAS: TotalTime %f ms, AllReduceTime %f ms, cuBLAS Time %f ms\n", totalTime, allReduceTime, cublasTime);
#endif
memset_value(m1m2, __float2half(0.0f), M*N);
totalTime = 0.0;
allReduceTime = 0;
matmulTime = 0;
int chunkRows;
// int chunkCols = 64;
assert(N % chunkCols == 0);
std::vector<std::vector<std::tuple<int, int, int, int>>> chunkBlocks = getChunkBlocks<half>(rank, M*N, comm_size, rings, M, N, chunkCols, chunkRows) ;
if (rank == 0 && false) {
float time = cutlassGeMM(M, N, K, rank, chunkBlocks);
printf("cutlass GeMM Time: %f\n", time);
}
MPI_Barrier(MPI_COMM_WORLD);
{
float cutlassTime = 0.0f;
allReduceTime = 0.0f;
//Overlapped AllReduce + CUTLASS
int length_m = M;
int length_n = N;
int length_k = K;
cutlass::gemm::GemmCoord problem_size(M, N, K);
cutlass::TensorRef<ElementInputA, LayoutInputA> tensor_a((cutlass::half_t*)m1, LayoutInputA::packed(problem_size.mk()));
cutlass::TensorRef<ElementInputB, LayoutInputB> tensor_b((cutlass::half_t*)m2, LayoutInputA::packed(problem_size.kn()));
cutlass::TensorRef<ElementOutput, LayoutOutput> tensor_c((cutlass::half_t*)_m1m2, LayoutInputA::packed(problem_size.mn()));
cutlass::TensorRef<ElementOutput, LayoutOutput> tensor_d((cutlass::half_t*)m1m2, LayoutInputA::packed(problem_size.mn()));
// Initialize alpha and beta for dot product computation
ElementComputeEpilogue alpha = ElementComputeEpilogue(1);
ElementComputeEpilogue beta = ElementComputeEpilogue(0);
// Split K dimension into 1 partitions
int split_k_slices = 1;
//Initialize the memory for thread block to tile map.
int numTiles = (length_m*length_n)/(ShapeMMAThreadBlock::kMN);
int* threadBlockToTileMap;
int* tileIdx;
int* tileStatusMap;
CUDACHECK(cudaMalloc(&tileIdx, sizeof(int)));
CUDACHECK(cudaMemset(tileIdx, 0, sizeof(int)));
CUDACHECK(cudaMalloc(&threadBlockToTileMap, numTiles * 2 * sizeof(int)));
//An array of integers for each tile to indicate if tile is waiting (0) or finished (1)
CUDACHECK(cudaMalloc(&tileStatusMap, numTiles * 4 * sizeof(int)));
CUDACHECK(cudaMemset(tileStatusMap, 0, numTiles * 4 * sizeof(int)));
//Create an array of tile order.
ShapeMMAThreadBlock shape;
int *tileOrder = new int[numTiles * 2];
int idx = 0;
for (int ty = 0; ty < length_m/ShapeMMAThreadBlock::kM; ty++) {
for (int tx = 0; tx < length_n/ShapeMMAThreadBlock::kN; tx++) {
tileOrder[idx] = tx;
tileOrder[idx + 1] = ty;
idx += 2;
}
}
std::vector<int> hChunksForTile;
int maxChunksForTile = 0;
const int combinedChunks = nChannels;
if (true) {
idx = 0;
int chunk = 0;
std::set<std::pair<int, int>> chunkTBs;
std::vector<std::pair<int, int>> tileOrderAsPair;
std::map<int, std::set<int>> tileToChunks;
int tilesForChunk = 0;
for (auto channelChunks: chunkBlocks) {
for (int channel = 0; channel < channelChunks.size(); channel++) {
auto chunk = channelChunks[channel];
int cy = std::get<0>(chunk);
int cx = std::get<1>(chunk);
int m = std::get<2>(chunk);
int n = std::get<3>(chunk);
int chunkIndex = cy/chunkRows * N/chunkCols + cx/chunkCols;
//For a chunk get all tiles required to obtain this chunk
int startTy = (cy/ ShapeMMAThreadBlock::kM) * ShapeMMAThreadBlock::kM;
for (int ty = startTy; ty < min(cy + m, length_m); ty += ShapeMMAThreadBlock::kM) {
for (int tx = cx; tx < min(cx + n, length_n); tx += ShapeMMAThreadBlock::kN) {
int tileIndex = ty/ShapeMMAThreadBlock::kM * (N/ShapeMMAThreadBlock::kN) + tx/ShapeMMAThreadBlock::kN;
if (tileToChunks[tileIndex].count(chunkIndex/combinedChunks) == 0) {
tileToChunks[tileIndex].insert(chunkIndex/combinedChunks);
// if (rank == 0 && cy >= 7920) {
// printf("cy %d cx %d chunkIndex %d\n", cy, cx, chunkIndex);
// tilesForChunk++;
// }
}
// if (chunkIndex == 0) {
// if (rank == 0)
// printf("1199: %d x %d -> %d x %d -> %d\n",
// cy, cx, ty/ShapeMMAThreadBlock::kM, tx/ShapeMMAThreadBlock::kN, tileIndex);
// }
if (chunkTBs.count(std::make_pair(ty,tx)) == 0) {
chunkTBs.insert(std::make_pair(ty,tx));
// if (rank == 0 && channel == 0)
// printf("%d x %d -> %d x %d -> %d\n", cy, cx, ty/ShapeMMAThreadBlock::kM, tx/ShapeMMAThreadBlock::kN, tileIndex);
tileOrderAsPair.push_back(std::make_pair(tx/ShapeMMAThreadBlock::kN, ty/ShapeMMAThreadBlock::kM));
}
}
}
}
}
// if (rank == 0) {
// printf("rank %d tilesForChunk %d\n", rank, tilesForChunk);
// }
for (auto v : tileToChunks) {
maxChunksForTile = std::max(maxChunksForTile, (int)v.second.size());
}
hChunksForTile = std::vector<int>(maxChunksForTile * numTiles, 0);
for (auto it : tileToChunks) {
int i = 0;
for (int c : it.second) {
hChunksForTile[it.first * maxChunksForTile + i] = c;
i++;
}
for (; i < maxChunksForTile; i++) {
hChunksForTile[it.first * maxChunksForTile + i] = -1;
}
}
int _idx = 0;
for (int i = 0; i < tileOrderAsPair.size(); i++) {
tileOrder[_idx] = tileOrderAsPair[i].second; //Swap because x ("m") is row and y ("n") is column.
tileOrder[_idx+1] = tileOrderAsPair[i].first;
// printf("%d %d\n", tileOrder[_idx], tileOrder[_idx + 1]);
_idx += 2;
idx += 2;
}
}
int* chunksForTile;
CUDACHECK(cudaMemcpy(threadBlockToTileMap, tileOrder, numTiles * 2 * sizeof(int), cudaMemcpyHostToDevice));
CUDACHECK(cudaMalloc(&chunksForTile, hChunksForTile.size() * sizeof(int)));
CUDACHECK(cudaMemcpy(chunksForTile, &hChunksForTile[0], hChunksForTile.size() * sizeof(int), cudaMemcpyHostToDevice));
// delete[] tileOrder;
typename Gemm::Arguments arguments{problem_size, // <- problem size of matrix multiplication
tensor_a, // <- reference to matrix A on device
tensor_b, // <- reference to matrix B on device
tensor_c, // <- reference to matrix C on device
tensor_d, // <- reference to matrix D on device
maxChunksForTile,
chunksForTile,
tileIdx,
threadBlockToTileMap,
tileStatusMap,
{alpha, beta}, // <- tuple of alpha and beta
split_k_slices}; // <- k-dimension split factor
// Using the arguments, query for extra workspace required for matrix multiplication computation
size_t workspace_size = Gemm::get_workspace_size(arguments);
// Allocate workspace memory
cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
// Instantiate CUTLASS kernel depending on templates
Gemm gemm_op;
// Check the problem size is supported or not
cutlass::Status status = gemm_op.can_implement(arguments);
CUTLASS_CHECK(status);
status = gemm_op.initialize(arguments, workspace.get());
CUTLASS_CHECK(status);
// cudaProfilerStart();
// CUDACHECK(cudaFuncSetAttribute(dummyKernel<80>,
// cudaFuncAttributeMaxDynamicSharedMemorySize,
// 96*1024));
CUDACHECK(cudaMemset(tileIdx, 0, sizeof(int)));
CUDACHECK(cudaMemset(tileStatusMap, 0, numTiles * 4 * sizeof(int)));
float minSampleTime = 10000000.0f;
float sampleTime;
for(int iter = 0; iter < 110; iter++) {
//CUDACHECK(cudaMemset(tileIdx, 0, sizeof(int)));
// CUDACHECK(cudaMemset(tileStatusMap, 0, numTiles * sizeof(int)));
if (rank == 0 && iter %20 == 0)
printf("iter %d\n", iter);
cudaEvent_t startpipe, stoppipe;
cudaEvent_t cutlassStartPipe, cutlassStopPipe;
float elapsedTimepipe, cutlassElapsedTimepipe;
// MPI_Barrier(MPI_COMM_WORLD);
CUDACHECK(cudaEventCreate(&startpipe));
CUDACHECK(cudaEventCreate(&stoppipe));
CUDACHECK(cudaEventCreate(&cutlassStartPipe));
CUDACHECK(cudaEventCreate(&cutlassStopPipe));
CUDACHECK(cudaEventRecord(startpipe, stream));
CUDACHECK(cudaEventRecord(cutlassStartPipe, cutlassStream));
double t1 = getCurrentTime();
//NCCLCHECK(ncclAllReduceMatrix(m1m2, M*N, M, N, N, ncclHalf, ncclSum, comm, stream));
NCCLCHECK(ncclAllReduceOverlapMatMul((const void*)m1, (void*)m2, (void*)m1m2, tileStatusMap, M*N, M, N, K, chunkCols, iter, ncclHalf, ncclSum, comm, stream));
// dummyKernel<80><<<12, 1024, 96*1024, stream>>>(tileStatusMap, numTiles, iter);
// First run to check results
status = gemm_op(iter, cutlassStream);
CUTLASS_CHECK(status);
CUDACHECK(cudaEventRecord(cutlassStopPipe, cutlassStream));
CUDACHECK(cudaEventSynchronize(cutlassStopPipe));
CUDACHECK(cudaEventElapsedTime(&cutlassElapsedTimepipe, cutlassStartPipe,cutlassStopPipe));
// printf("cutlassElapsedTimepipe %f\n", cutlassElapsedTimepipe);
CUDACHECK(cudaEventRecord(stoppipe, stream));
CUDACHECK(cudaEventSynchronize(stoppipe));
double t2 = getCurrentTime();
CUDACHECK(cudaEventElapsedTime(&elapsedTimepipe, startpipe,stoppipe));
CUDACHECK(cudaEventElapsedTime(&cutlassElapsedTimepipe, cutlassStartPipe,cutlassStopPipe));
if (iter >= 10) {
totalTime += (t2-t1)*1000.0f;
allReduceTime += elapsedTimepipe;
cutlassTime += cutlassElapsedTimepipe;
sampleTime += (t2-t1)*1000.0f;
if (iter > 10 && iter % 10 == 0) {
minSampleTime = std::min(minSampleTime, sampleTime*10);
sampleTime = 0;//(t2-t1)*1000.0f;
}
}
if (iter == 0)
{
MPI_Barrier(MPI_COMM_WORLD);
float *hm1 = new float[M*K];
float *hm2 = new float[N*K];
float *hm1m2 = new float[M*N];
cudaMemcpyHalfDevice2FloatHost(hm1, m1, M*K);
cudaMemcpyHalfDevice2FloatHost(hm2, m2, N*K);
cudaMemcpyHalfDevice2FloatHost(hm1m2, m1m2, M*N);
if (rank == 0)
printf("checking results at iter %d %d\n", iter, rank);
if (!mpiRef(hm1, hm2, hm1m2, M, N, K, comm_size, rank))
assert(false);
}
}
// cudaProfilerStop();
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0)
printf("rank %d ===Overlapped(AllReduce, cutlass) Time: %f ms cutlass: %f ms, allreduceTime: %f ms, minSampleTime: %f ms\n", rank, totalTime, cutlassTime, allReduceTime, minSampleTime);
// printf("rank %d cutlass %f\n", rank, cutlassTime);
}
CUDACHECK(cudaFree(m1));
CUDACHECK(cudaFree(m2));
CUDACHECK(cudaFree(m1m2));
}
}
MPI_Finalize();
}
Reason for seg faults is that NCCL_BUFFSIZE is really large, try with lower value of NCCL_BUFFSIZE.
Environment: ubutun20.4 and cuda10.2 , 8-GPU-V100
1.operate: Download the latest code, enter the nccl-overlap directory, and recompile nccl and matmul-allreduce
_make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm70" make matmul-allreduce-release
Execute the command _mpirun -np 8 --allow-run-as-root -x LD_LIBRARY_PATH="../build/lib/:/usr/local/cuda/lib64/:$LD_LIBRARY_PATH" -x NCCL_PROTO=Simple -x NCCL_ALGO=Ring -x NCCL_DEBUG=INFO -x NCCL_MIN_NCHANNELS=24 -x NCCL_MAX_NCHANNELS=24 -x NCCL_NTHREADS=512 -x NCCLBUFFSIZE=6291456 ./matmul-allreduce
2. Performance data
AllReduce+cuBLAS: TotalTime 157.270889 ms, AllReduceTime 106.608643 ms, cuBLAS Time 48.984436 ms Overlapped(AllReduce, cutlass) Time: 207.144745 ms cutlass: 109.829987 ms, allreduceTime: 206.435776 ms, minSampleTime: 206.210617 ms AllReduce+cuBLAS: TotalTime 262.859344 ms, AllReduceTime 174.754623 ms, cuBLAS Time 86.210968 ms Overlapped(AllReduce, cutlass) Time: 382.225433 ms cutlass: 213.734665 ms, allreduceTime: 381.512939 ms, minSampleTime: 368.232727 ms AllReduce+cuBLAS: TotalTime 469.725128 ms, AllReduceTime 301.069916 ms, cuBLAS Time 166.952026 ms Overlapped(AllReduce, cutlass) Time: 682.020752 ms cutlass: 440.898163 ms, allreduceTime: 681.286255 ms, minSampleTime: 680.000732 ms AllReduce+cuBLAS: TotalTime 914.322266 ms, AllReduceTime 575.207947 ms, cuBLAS Time 329.937164 ms Overlapped(AllReduce, cutlass) Time: 1359.441162 ms cutlass: 929.144897 ms, allreduceTime: 1358.669312 ms, minSampleTime: 1355.585938 ms
3. question When setting different NCCL_MIN_NCHANNELS and mpirun -np, there will be a stall situation, will you also encounter it? How did you solve it? How to reproduce the 1.36x acceleration in your paper, can you provide a detailed environment and corresponding runtime parameter settings to reproduce your paper?