Closed lixiaolx closed 1 year ago
Both CoCoNet and NCCL times are obtained by using the values of NCCL_BUFFSIZE and CHANNELS that performs best. The tile size for cutlass were same. Although it is possible that different tile size for cutlass might have made CoCoNet perform better.
I would like to ask whether the 1.36X comparison base in the paper is obtained from the same nccl-channel test?
Or select the optimal value of different channels of cubals+allreduce, and compare the optimal value of different channels of overlap to get 1.36X