tyohei commented 5 years ago

疑問

ncclAsyncMode とは
Channel とは
opCount とは
devComm とは → comm との違い
`llmode とは
NCCL_MAX_OPS とは

AllReduce の動作

AllReduce の main

ncclCommInitRank
ncclAllReduce
- struct: info = ncclCommInfo(...) collectives/all_reduce.cc#L14
- function ncclEnqueueCheck(&info)
ncclCommDestroy

`ncclEnqueueCheck` enqueue.cc#L409

IF ncclAsyncModel
ELSE
- ArgsCheck(info)
- saveKernel(info)
- ncclBarrierEnqueue(info->comm)
- ncclBarrierEnqueueWait(info->comm)
- ncclEnqueueEvents(info->comm)

`saveKernel(info)` enqueue.cc#L356

computeColl(info, &coll, &proxyArgs) → 何をしている？
blockDim.x を指定 → 何をしている？
cudaStream 指定
for (int bid=0; bid<coll.args.nChannels, bid++)

For 文の中身

まだ良くわからない

Block ID ごとにループを回している？
ループが回る回数は nChannels だけ → 何を意味している？

`computeColl(info, &coll, &proxyArgs)`

info から coll へ情報（root, count, send/recv buffers, devComm, opCount）をコピー
info から proxyArgs へ情報（nsteps, sliceSteps, chunkSteps, llMode, opCount）を計算及びコピー
- その際に llMode or treeMode を判定している → これらの違いは？
- llMode の意味は？
- sliceSteps の意味は？
- chunkSteps の意味は？

`ncclBarrierEnqueue(info->comm)`

基本的には User stream （nccl の通信を呼び出すストリーミ？、デフォルトストリーミかな）、とパラメータで指定されたストリーム（グループの場合もある）の違いによって、 cudaStreamWaitEvent を呼び出すようになっている

ncclCpuBarrierIn → わからない
ncclCpuBarrierLast → わからない

このあとに if(isLast) があるので、intra のプロセスを待ち、最後のプロセスが処理をするようになっている

    if (comm->launchMode == ncclComm::GROUP) {
      // I'm the last. Launch all operations.
      NCCLCHECK(ncclLaunchCooperativeKernelMultiDevice(comm->intraParams, comm->intraCudaDevs, comm->intraRanks, *comm-
    }
    NCCLCHECK(ncclCpuBarrierLast(comm));

こんな処理

`ncclBarrierEnqueueWait(info->comm)`

ncclCpuBarrierOut(comm)
launchMode == GROUP の場合は ncclBarrierEnqueue でカーネルが呼びされていだが、 PARALLEL の場合はここで呼び出される
comm->channels に格納されている各 channel に対して次の二つのを定義
- collStart = channel->collFifiTail
- collCount = 0
NCCLCHECK(transportStartProxy(comm));

// Start the network proxies as soon as the kernel has been launched. We can't // perform any CUDA call between the two or having a cudaFree between the CUDA // launch and the transportStartProxy call could cause a deadlock. // Also, starting the proxies after the CUDA launch seems to be better for // performance (latency).

`ncclEnqueueEvents(comm)`

必要ならストリームを待つのと、 userStreamSet を初期化する

けっこう単純な関数

ncclResult_t ncclEnqueueEvents(ncclComm_t comm) {
  struct cudaLaunchParams *params = comm->myParams;
  // Enqueue event after NCCL kernel
  CUDACHECK(cudaEventRecord(comm->doneEvent, params->stream));
  // Use internal NCCL stream for CGMD/GROUP launch if required or if the user stream is NULL
  if (comm->launchMode == ncclComm::GROUP && (comm->groupCudaStream || comm->userStream == NULL)) {
    // Create dependency between NCCL internal stream and user stream
    CUDACHECK(cudaStreamWaitEvent(comm->userStream, comm->doneEvent, 0));
  }
  comm->userStreamSet = false;
  return ncclSuccess;
}

tyohei commented 5 years ago

ncclColl

struct ncclColl {
  union {
    struct {
      struct CollectiveArgs args;
      uint16_t funcIndex;
      uint16_t nextIndex;
      uint8_t  active;
    };
    int data[0x10];
  };
};

tyohei commented 5 years ago

ncclInfo

struct ncclInfo {
  ncclColl_t coll;
  const char* opName;
  // NCCL Coll Args
  const void* sendbuff;
  void* recvbuff;
  size_t count;
  ncclDataType_t datatype;
  ncclRedOp_t op;
  int root;
  ncclComm_t comm;
  cudaStream_t stream;
  // Algorithm details
  int chunkSteps;
  int sliceSteps;
  // Computed later
  ncclPattern_t pattern;
  size_t nBytes;
  int nstepsPerLoop;
  int nchunksPerLoop;
};

tyohei commented 5 years ago

CollectiveArgs

struct CollectiveArgs {
  struct ncclDevComm* comm;
  uint64_t opCount;

  // local and remote input, output, and buffer
  const void * ThisInput;
  void * ThisOutput;

  // general parameters
  size_t N;
  uint32_t root;
  uint8_t bid;
  uint8_t nChannels;
  uint16_t nThreads;

  int lastChunkSize;
};

tyohei commented 5 years ago

今までのを整理

thread 1 GPU → Group calls を使わない → comm:launchModel == PARALLEL を仮定
CUDA カーネルは ncclBarrierEnqueueWait(info->comm) (enqueue.cc:438) で呼び出される
実際の cudaLaunchKernel は enqueue.cc:197 にあり、関数名などは comm->myParams に格納されている
comm->myParams に入る値を決定しているのが saveKernel という関数である

次は myParams に意識して saveKernel を読んでいく

tyohei commented 5 years ago

`saveKernel`

`computeColl`

getPatternInfo は info->pattern を呼び出されるOpによって決定する（AllReduceなら、これとか）
getLoopInfo は info->nstepsPerLoop を info->pattern から決定する
getKernelInfo は info から 1) nChannels と nThreads と llmode (low-latency) を決定する

For 文

gridDim.xの初期値は 0 (init.cc:577)
nChannels の初期値は nrings
nrings は transport/p2p.cc で頑張って定義されている
NCCL_MAX_OPS は NCCL Aggregated Operations のためにあるので今回は上限に行かない想定で大丈夫そう
channel->collFifoTail の初期値は 0 で良さそう int 型だし
activePtr は aggregate coll の現在のポインタ
最終的には gridDim.x == coll.args.nChannels になる
coll.args.nChannels と info->comm->nChannels が違う？
やはり computeColl を読まないとだめか。。
transportSaveProxies(&proxyArgs, info->pattern, info->root, info->comm->nRanks) が何をしているをのかがわからないな
- 内部で SaveProxy を呼んでいる

tyohei commented 5 years ago

9/1

[ ] saveKernel について読む

理由： CUDAカーネルが実際にローンチされているのは ncclBarrierEnqueueWait の関数内。で、このカーネルの実引数は info->comm->myParams が渡されてるで、この info->comm->myParams は saveKernel 内で色々定義されている

tyohei commented 5 years ago

init.cc:576:  params->blockDim.x = 0; params->blockDim.y = params->blockDim.z = 1;

変更されるのは blockDim.x のみ、 blockDim.y と blockDim.z は常に 1 に固定

init.cc:577:  params->gridDim.x = 0; params->gridDim.y = params->gridDim.z = 1;

変更されるのは blockDim.x のみ、 blockDim.y と blockDim.z は常に 1 に固定

tyohei commented 5 years ago

ncclChannel

struct ncclChannel {
  union {
    struct {
      struct ncclRing ring;
      struct ncclTree tree;

      int id;
      int nthreads;
      int buffSize;

      // Communication structures
      struct ncclPeer* peers;
      struct ncclPeer* devPeers;

      // Operation list for aggregation
      struct ncclColl* collectives;
      struct ncclColl* devCollectives;
      int collStart;
      int collCount;
      int collFifoHead; // Only used by GPU
      int collFifoTail; // Only used by CPU
    };
    int data[0x80];
  };
};
static_assert(sizeof(struct ncclChannel) == 0x80*sizeof(int), "ncclChannel must have a pow2 size");

tyohei commented 5 years ago

info->comm->myParams->gridDim.x == coll.args.nChannels が成立が普通に利用する場合は gridDim.x == nChannels が成立する

coll.args.nChannels <= info->comm->nChannels が成立 coll.args.nChannels は saveKernel() 内の computeColl() で定義される info->comm->nChannels は ncclInitRank() 内の ncclGetRings() で定義される

info->comm->myParams->blockDim.x >= coll.args.nThreads が成立が普通に利用する場合は blockDim.x == nThreads - 1 が成立する

For 文の中身を呼んでいるが、

struct ncclChannel* channel = info->comm->channels+(info->comm->myParams->gridDim.x % info->comm->nChannels);

ここで、同じ channel になりなえない気がする。んー

tyohei commented 5 years ago

ncclProxyArgs

struct ncclProxyArgs {
  proxyProgressFunc_t progress;
  struct ncclChannel* channel;
  struct ncclConnector* connector;
  int sliceSteps;
  int chunkSteps;
  int nsteps;
  uint64_t opCount;
  int llMode;
  int state;   // add component before this line -- it is left out during initialization

  // Internal state
  uint64_t head;
  uint64_t tail;
  uint64_t end;
  void* requests[NCCL_STEPS];
  int idle;

  // Element linking
  pthread_mutex_t mutex;
  struct ncclProxyArgs* next;
  struct ncclProxyArgs* nextPeer;
};

tyohei commented 5 years ago

煮詰まってきたのでここまでのを整理

AllReduce (mode=PARALLEL) の全体の流れ

ncclAllReduce() が呼ばれ ncclInfo 型の info が引数として ncclEnqueueCheck() が呼ばれる
ncclEnqueueCheck() 内の saveKernel() が myParams と channels に必要な情報を入れる
ncclEnqueueCheck() 内の ncclBarrierEnqueueWait() が comm->myParams を引数としてカーネルを起動する
直後に comm->channels のすべてのチャンネルが起動される（起動という言葉が正しいかわからない）
最終的に transportStartProxy() の呼び出しで完了

saveKernel の流れ

ほとんどの引数は comm->myParams には直接代入しない
comm->myParams->blockDim.x = nThreads - 1 ( nThreads は ncclInitRank 内部で定義済) と
comm->myParams->gridDim.x = nChannels ( nChannels は ncclInitRank 内部で定義済) だけ直接代入される
直接代入しない代わりにローカル変数の coll to proxyArgs を利用する
coll は info のデータと Grouping コール用のデータを両方持っている
proxyArgs は低レーヤー？の通信用のデータを持っている（nsteps, sliceSteps, chunkSteps, llmode, opCount）など
coll と proxyArgs への適切な値の代入は computeColl() でなされる
この際にスレッド数やチャンネル数を ncclInitRank とは別に計算する
CUDAグリッドの個数は上で求めたチャンネル数と同じになるように設定
CUDAブロックの個数は上で求めたスレッド数-1と同じになるように設定
For 文で comm->channels の各チャンネルに proxyArgs と coll の情報を代入 ( transportSaveProxies かな？)

次読むもの

transportStartProxy が何をしているのか？ → CUDAカーネルの起動とは別の何かを起動する必要がある？
transportSaveProxies が何をしているのか？ → どのように代入されるのか？
step, slice, and chunk ???

nccl-reader / nccl

[tsuji] 読み進めメモ #3

疑問

AllReduce の動作

AllReduce の main

`ncclEnqueueCheck` enqueue.cc#L409

`saveKernel(info)` enqueue.cc#L356

For 文の中身

`computeColl(info, &coll, &proxyArgs)`

`ncclBarrierEnqueue(info->comm)`

`ncclBarrierEnqueueWait(info->comm)`

`ncclEnqueueEvents(comm)`

ncclColl

ncclInfo

CollectiveArgs

今までのを整理

`saveKernel`

`computeColl`

For 文

9/1

ncclChannel

ncclProxyArgs

AllReduce (mode=PARALLEL) の全体の流れ

saveKernel の流れ

次読むもの

nccl-reader / nccl

[tsuji] 読み進めメモ #3

疑問

AllReduce の動作

AllReduce の main

ncclEnqueueCheck enqueue.cc#L409

saveKernel(info) enqueue.cc#L356

For 文の中身

computeColl(info, &coll, &proxyArgs)

ncclBarrierEnqueue(info->comm)

ncclBarrierEnqueueWait(info->comm)

ncclEnqueueEvents(comm)

ncclColl

ncclInfo

CollectiveArgs

今までのを整理

saveKernel

computeColl

For 文

9/1

ncclChannel

ncclProxyArgs

AllReduce (mode=PARALLEL) の全体の流れ

saveKernel の流れ

次読むもの

`ncclEnqueueCheck` enqueue.cc#L409

`saveKernel(info)` enqueue.cc#L356

`computeColl(info, &coll, &proxyArgs)`

`ncclBarrierEnqueue(info->comm)`

`ncclBarrierEnqueueWait(info->comm)`

`ncclEnqueueEvents(comm)`

`saveKernel`

`computeColl`