microsoft / msccl-tools

Synthesizer for optimal collective communication algorithms
MIT License
98 stars 25 forks source link

Problem in generating xml for the allreduce #56

Open azharlightelligence opened 10 months ago

azharlightelligence commented 10 months ago

Hi, first of all thanks for quick response and I found that the examples and their .xml generated algos do have significant impact on the system performance. I moved one step further to test custom auto generated algos. Follwowing example script works well for the allgather :

from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os

topology = dgx_a100()
pprint(topology.links)
collective = allgather(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
    text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")

Output

[[0, 12, 12, 12, 12, 12, 12, 12],
 [12, 0, 12, 12, 12, 12, 12, 12],
 [12, 12, 0, 12, 12, 12, 12, 12],
 [12, 12, 12, 0, 12, 12, 12, 12],
 [12, 12, 12, 12, 0, 12, 12, 12],
 [12, 12, 12, 12, 12, 0, 12, 12],
 [12, 12, 12, 12, 12, 12, 0, 12],
 [12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (0.4s)
Wrote to test.xml

But the problem comes if I change allgather to allreduce as below:

from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os

topology = dgx_a100()
pprint(topology.links)
collective = allreduce(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
    text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")

Output

[[0, 12, 12, 12, 12, 12, 12, 12],
 [12, 0, 12, 12, 12, 12, 12, 12],
 [12, 12, 0, 12, 12, 12, 12, 12],
 [12, 12, 12, 0, 12, 12, 12, 12],
 [12, 12, 12, 12, 0, 12, 12, 12],
 [12, 12, 12, 12, 12, 0, 12, 12],
 [12, 12, 12, 12, 12, 12, 0, 12],
 [12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (1.4s)
Traceback (most recent call last):
  File "/miniconda3/envs/py38/bin/msccl", line 33, in <module>
    sys.exit(load_entry_point('msccl==2.3.0', 'console_scripts', 'msccl')())
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/__main__.py", line 34, in main
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/cli/ncclize.py", line 29, in handle
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/ncclize.py", line 548, in ncclize
RuntimeError: Encountered receive and send on the same buffer index on step 1 (gpu=5, buf=i, off=0)

Can you please help check and resolve this issue so that I can use the generated .xml. Thanks in advance