huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.28k stars 367 forks source link

help to do SFT usning multi-machine, for example 8 nodes (1 A100 for 1 node) #69

Open Atlantic8 opened 7 months ago

Atlantic8 commented 7 months ago

I modified deepspeed_sero3.yaml, set num_machines to 8 and num_processes to 8, and I got the following error, what else should I do to run SFT on 8 nodes platform. Thanks

  File "/home/work/xx/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/work/xx/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/work/xx/lib/python3.11/site-packages/accelerate/commands/launch.py", line 971, in launch_command
    deepspeed_launcher(args)
  File "/home/work/xx/lib/python3.11/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 230, in launch_agent
    master_addr, master_port = _get_addr_and_port(rdzv_parameters)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 170, in _get_addr_and_port
    master_addr, master_port = parse_rendezvous_endpoint(endpoint, default_port=-1)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/xx/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 95, in parse_rendezvous_endpoint
    raise ValueError(
ValueError: The port number of the rendezvous endpoint 'None:None' must be an integer between 0 and 65536.
JiuhaiChen commented 3 months ago

@Atlantic8 You solved this issue?