hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.72k stars 4.34k forks source link

[DOC]: Is there documentation on how to create hostfiles #5965

Closed zhenbuxianggaimingzi closed 2 months ago

zhenbuxianggaimingzi commented 2 months ago

📚 The doc issue

According to the distributed training startup instructions on the documentation,

colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  test.py

I need to create a hostfile file where each row is the parameter configuration for a node,

host1
host2

but is there any documentation or example that explains what parameters are for each row and column of hostfile? I don't saw any instructions for that, such as ip, port, gpu id, etc

Edenzzzz commented 2 months ago

It should be the ip addresses of the machines, all of which can ssh into one another. You can set --ssh-port but must ensure that this is open on all machines. Ex: colossalai run --hostfile hosts.txt --master_addr 10.20.1.170 --master_port 29505 --ssh-port 34000 --nproc_per_node 4 benchmark.py

zhenbuxianggaimingzi commented 2 months ago

thank you for your reply. what I also want to ask is how to write hosts.txt? for example, each line in hosts.txt contains the following information:

ip gpu_ids port
xx.xxx.xx.22 0,1,2,3 9999
xx.xxx.xx.23 0,1,2,3 8888

but I'm not sure what the canonical way is, it doesn't seem to be documented

Edenzzzz commented 2 months ago

You should only put in the IPs. I may update the docs when available, or you can submit a PR