microsoft / msccl-tools

Synthesizer for optimal collective communication algorithms
MIT License
98 stars 25 forks source link

allreduce on generic.ring() with num_node >= 5 cannot be solved. #47

Open cabinz opened 1 year ago

cabinz commented 1 year ago

Problem

The following code cannot solve a least-step algo of allreduce on the ring topology:

topo = ring(5)
coll = allreduce(topo.num_nodes())
algo = solve_least_steps(topo, coll, logging=True)
print(algo)

The program seems to iterate forever.

Description

I found that solve_least_step() work well with topology generic.ring(4) and collective allreduce(), when I am using the following code:

# only to change the ring(5) down to ring(4)
topo = ring(4)
coll = allreduce(topo.num_nodes())
algo = solve_least_steps(topo, coll, logging=True)
print(algo)

Terminal log is as below:

Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:0→3, 0:1→2, 0:2→1, 0:3→0
(step 2) 0:0→1, 0:1→0, 0:2→3, 0:3→2

However, if I run with ring(5) with the code snippet provided in the Problem section, no valid algo can be synthesized in acceptable time, the program will keep running with the terminal logging out as below:

Algorithms need at least 2 steps.
Solving instance steps=2... unsatisfiable. (0.1s)
Solving instance steps=3... unsatisfiable. (0.1s)
Solving instance steps=4... unsatisfiable. (0.1s)
...
Solving instance steps=100... unsatisfiable. (0.5s)
Solving instance steps=101... unsatisfiable. (0.5s)
Solving instance steps=102... unsatisfiable. (0.5s)
Solving instance steps=103... unsatisfiable. (0.5s)
Solving instance steps=104... unsatisfiable. (0.5s)
Solving instance steps=105... unsatisfiable. (0.5s)
Solving instance steps=106... unsatisfiable. (0.5s)
...

To the best of my knowledge, an AllReduce operator always has a reduce-broadcast implementation (may be not the optimal one). And the code below works properly:

topo = ring(5)
coll = reduce(topo.num_nodes(), 0)
algo = solve_least_steps(topo, coll, logging=True)
print(algo)
coll = broadcast(topo.num_nodes(), 0)
algo = solve_least_steps(topo, coll, logging=True)
print(algo)

with output:

Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:2→1, 0:3→4
(step 2) 0:1→0, 0:4→0
Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:0→1, 0:0→4
(step 2) 0:1→2, 0:4→3

indicating that there at least a 4-step algo for allreduce on this ring(5) topology.

Could you please kindly tell me how to synthsize a least-step allreduce algo for ring topology with 5 and more nodes?

saeedmaleki commented 1 year ago

Seems to be a bug with AllReduce. In fact, I tried AllGather and it solves it pretty quickly: msccl solve instance Ring Allgather -s 2 -n 5 Thanks for reporting this!

@olsaarik can you please take a look?

nariaki3551 commented 1 year ago

Hello,

I want to share some insights into this issue. The functions solve_least_steps and solve_all_latency_bandwidth_tradeoffs do not support CNR algorithms like Allreduce, as mentioned in SYNTHESIS.md. But somehow it seems to work fine with ring topologies of 4 nodes or less.

Further details, when Allreduce is given to solve_least_steps, ReductionNotApplicableError raises while creating the PathEncoding object. (precondition and postcondition) of combining algorithms are converted to the condition of the corresponding non-combining algorithm, but in the case of Allreduce, the condition is not converted because of this ReductionNotApplicableError. As a result, the problem is encoded with the same condition as AllGather's, except for added constraints for Allreduce. Because of these added constraints, the problem becomes infeasible (and keeps the function running forever) for ring topology with 5 nodes.

I think it it would be better to raise an exception when Allreduce is given into solve_least_steps or solve_all_latency_bandwidth_tradeoffs.