Lazily initialize TL NCCL on first CUDA collective.
Why ?
Both NCCL and CUDA require CUDA devices to be set before team create. In MPI workloads it's not always possible since MPI_Init creates UCC team and to set device we need to know rank and local rank.
What
Lazily initialize TL NCCL on first CUDA collective.
Why ?
Both NCCL and CUDA require CUDA devices to be set before team create. In MPI workloads it's not always possible since MPI_Init creates UCC team and to set device we need to know rank and local rank.
replaces #758