Closed ebsmothers closed 6 days ago
Note: Links to docs will display an error until the docs builds have been completed.
There are 1 currently active SEVs. If your PR is affected, please view them below:
As of commit 8af1708f47c84b0e0e2af38476a68b656cd6ac68 with merge base bca5899480f54ebb85fea16231707ec36ee606ad (): :green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Can users still have the option to pass in a specific port? Wondering for production environments / high compute if this degree of control is more preferrable than auto selecting
@RdoubleA yeah fair point. I guess we could wrap the endpoint definition in an if statement or something to check if it’s already been passed by the user. The annoying thing is that there are already other torchrun defaults for these values (e.g. they use “static” for backend by default instead of “c10d” and we need to override that)
OK @RdoubleA let me know if the latest updates look reasonable to you. I realized there is a --standalone
torchrun flag that will do the same thing as --rdzv-backend=c10d --rdzv-endpoint=localhost:0
(see the description in this commit message). So I will enable this by default, but only if --rdzv-endpoint
is not passed. That way we still give the ability to set the endpoint explicitly and don't have to muck around with any other defaults like --rdzv-backend
ourselves to do it.
Previously it was not possible to launch more than one distributed training job on the same node at the same time, as torchrun will try to use the same port for both of them by default. It's possible to manually pass
--rdzv-backend
and--rdzv-endpoint
flags to torchrun anytime you kick off a second run, but this is annoying (and not obvious). Instead we can just default to letting torchrun find a free port automatically.Test plan:
Run both the following commands on the same node.
Before this PR, the second job will fail with an error message like:
After this PR, both jobs are able to train successfully: