microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.66k stars 3.95k forks source link

[REQUEST] Launcher mode with SSH bypass #5510

Open dogacancolak-kensho opened 1 month ago

dogacancolak-kensho commented 1 month ago

Is your feature request related to a problem? Please describe. https://github.com/microsoft/DeepSpeed/issues/2679 As previously mentioned in this issue, the existing launching mechanism requires password-less SSH. We did not prefer this at Kensho Technologies, as our current multi-node training framework uses a launching mechanism similar to torchrun.

Instead of a launcher node ssh-ing the command to the workers, torchrun works by providing a master address/port, and a node rank for each worker. By bypassing SSH and using deepspeed directly like torchrun, we can seamlessly integrate DeepSpeed to our existing setup, instead of having two different launching topologies.

Describe the solution you'd like In a private fork of DeepSpeed, we were able to get training working without using SSH. To do this, we added a flag to the launcher-runner called --no_ssh, which also depends on a --node_rank flag to be provided.

Then, in the runner, the command is ran as if multi_node_exec is disabled. We have verified that this method works.

Describe alternatives you've considered As mentioned, we considered setting up two topologies based on the framework used. For example, GPT-NeoX uses the deepspeed launcher, therefore we would need the SSH setup. However, MosaicML's llm-foundry works by independently running the command on each worker (similar to torchrun). We didn't want to create two architectures depending on which framework was being used for training.

Additional context If deemed useful by the project maintainers, we can make a PR, with S&P Global/Kensho Technologies as the contributing entity.

tjruwase commented 1 month ago

@dogacancolak-kensho, thanks for offering a PR for this useful enhancement. Please submit the PR at your convenience. Thanks!

dogacancolak-kensho commented 1 month ago

Do I need to be given permissions? I'm trying to push my local branch dogacancolak/no-ssh-launcher

$ git push
ERROR: Permission to microsoft/DeepSpeed.git denied to dogacancolak-kensho.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
dogacancolak-kensho commented 1 week ago

Hello, could I get an update to this please? I think I need to be given permissions.

tjruwase commented 1 week ago

@dogacancolak-kensho, you need to create a PR to be reviewed in order to merge your changes. Contributors cannot push directly into main branch as standard practice.

dogacancolak-kensho commented 1 week ago

Thank you for the quick response. I'm getting this error when I try to push a new branch, not pushing directly into the main branch. It seems I don't have access to create a PR in the first place