alltoall performance regression after upgrading from 2021.1-beta07-1 to 1.10

Hi, We upgraded torch-ccl from 2021.1-beta07-1 to 1.10 and noticed some performance regression for all_to_all. overall, ccl 1.10 is 2x worse than 2021.1-beta07-1. system config:

single node, 2 proc_per_node, so no network communication

Any idea on the root cause?

all_to_all profiling for torch ccl 1.10 all2all-ccl1.10110

all_to_all profiling for torch ccl 2021.1-beta07-1 all2all-ccl2021.1-beta07-1

test code:

import torch
import extend_distributed as ext_dist

if __name__ == "__main__":
    ext_dist.init_distributed(backend='ccl')
    input = []
    tensor = torch.ones(262144, 16, dtype=torch.bfloat16)
    input.append(tensor)
    with torch.autograd.profiler.profile(True) as prof:
        for _ in range(10):
            a2a_req = ext_dist.alltoall(input, None)
            ly_sparse = a2a_req.wait()
    print(prof.key_averages().table(sort_by="cpu_time_total"))

For extend_distributed, please refer to https://github.com/IntelAI/models/blob/master/models/recommendation/pytorch/dlrm/training/bfloat16/extend_distributed.py

Thanks

intel / torch-ccl

alltoall performance regression after upgrading from 2021.1-beta07-1 to 1.10 #34