Hi,
We upgraded torch-ccl from 2021.1-beta07-1 to 1.10 and noticed some performance regression for all_to_all. overall, ccl 1.10 is 2x worse than 2021.1-beta07-1.
system config:
single node, 2 proc_per_node, so no network communication
Any idea on the root cause?
all_to_all profiling for torch ccl 1.10
all_to_all profiling for torch ccl 2021.1-beta07-1
test code:
import torch
import extend_distributed as ext_dist
if __name__ == "__main__":
ext_dist.init_distributed(backend='ccl')
input = []
tensor = torch.ones(262144, 16, dtype=torch.bfloat16)
input.append(tensor)
with torch.autograd.profiler.profile(True) as prof:
for _ in range(10):
a2a_req = ext_dist.alltoall(input, None)
ly_sparse = a2a_req.wait()
print(prof.key_averages().table(sort_by="cpu_time_total"))
Hi, We upgraded torch-ccl from 2021.1-beta07-1 to 1.10 and noticed some performance regression for all_to_all. overall, ccl 1.10 is 2x worse than 2021.1-beta07-1. system config:
Any idea on the root cause?
all_to_all profiling for torch ccl 1.10
all_to_all profiling for torch ccl 2021.1-beta07-1
test code:
For
extend_distributed
, please refer to https://github.com/IntelAI/models/blob/master/models/recommendation/pytorch/dlrm/training/bfloat16/extend_distributed.pyThanks