Closed zhenbuxianggaimingzi closed 2 months ago
It should be the ip addresses of the machines, all of which can ssh into one another. You can set --ssh-port but must ensure that this is open on all machines.
Ex:
colossalai run --hostfile hosts.txt --master_addr 10.20.1.170 --master_port 29505 --ssh-port 34000 --nproc_per_node 4 benchmark.py
thank you for your reply. what I also want to ask is how to write hosts.txt? for example, each line in hosts.txt contains the following information:
ip gpu_ids port
xx.xxx.xx.22 0,1,2,3 9999
xx.xxx.xx.23 0,1,2,3 8888
but I'm not sure what the canonical way is, it doesn't seem to be documented
You should only put in the IPs. I may update the docs when available, or you can submit a PR
📚 The doc issue
According to the distributed training startup instructions on the documentation,
I need to create a hostfile file where each row is the parameter configuration for a node,
but is there any documentation or example that explains what parameters are for each row and column of hostfile? I don't saw any instructions for that, such as ip, port, gpu id, etc