torchrun defaults for concurrent distributed training jobs

ebsmothers commented 6 days ago

Previously it was not possible to launch more than one distributed training job on the same node at the same time, as torchrun will try to use the same port for both of them by default. It's possible to manually pass --rdzv-backend and --rdzv-endpoint flags to torchrun anytime you kick off a second run, but this is annoying (and not obvious). Instead we can just default to letting torchrun find a free port automatically.

Test plan:

Run both the following commands on the same node.

CUDA_VISIBLE_DEVICES=0,1 tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_2/1B_lora
CUDA_VISIBLE_DEVICES=2,3 tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_2/1B_lora

Before this PR, the second job will fail with an error message like:

  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 67, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use

After this PR, both jobs are able to train successfully:

1|79|Loss: 1.0284696817398071:  10%|█████████▉                                                                                            | 79/808 [01:44<15:26,  1.27s/it]
1|63|Loss: 1.0152941942214966:   8%|███████▋                                                                                           | 63/808 [01:24<16:20,  1.32s/it]

pytorch-bot[bot] commented 6 days ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2015

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

:white_check_mark: No Failures

As of commit 8af1708f47c84b0e0e2af38476a68b656cd6ac68 with merge base bca5899480f54ebb85fea16231707ec36ee606ad (): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA commented 6 days ago

Can users still have the option to pass in a specific port? Wondering for production environments / high compute if this degree of control is more preferrable than auto selecting

ebsmothers commented 6 days ago

@RdoubleA yeah fair point. I guess we could wrap the endpoint definition in an if statement or something to check if it’s already been passed by the user. The annoying thing is that there are already other torchrun defaults for these values (e.g. they use “static” for backend by default instead of “c10d” and we need to override that)

ebsmothers commented 6 days ago

OK @RdoubleA let me know if the latest updates look reasonable to you. I realized there is a --standalone torchrun flag that will do the same thing as --rdzv-backend=c10d --rdzv-endpoint=localhost:0 (see the description in this commit message). So I will enable this by default, but only if --rdzv-endpoint is not passed. That way we still give the ability to set the endpoint explicitly and don't have to muck around with any other defaults like --rdzv-backend ourselves to do it.

pytorch / torchtune