microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.64k stars 4.04k forks source link

[BUG] unable to use a hostfile with a name that is not "hostfile" #5221

Open siddharth9820 opened 6 months ago

siddharth9820 commented 6 months ago

Describe the bug I am trying to launch multiple Megatron-DeepSpeed jobs on a slurm based cluster. For each job, I want to create a different hostfile called hostfile_${SLURMJOBID}. However, when I tried to do this along with deepspeed --hostfile=hostfile${SLURM_JOBID}, I saw from the logs that deepspeed was unable to detect the hostfile and set the worldsize to 4 (which is the number of GPUs on a single node) erroneously. Interestingly, creating a hostfile with the name 'hostfile' does not lead to this error.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Launcher context

Docker context n/A

Additional context Add any other context about the problem here.

jeffra commented 6 months ago

Hi @siddharth9820, this is interesting. Can you double check the hostfiles with the slurm ids are being created and accessible at the path you are launching from?

https://github.com/microsoft/DeepSpeed/blob/bcc617a0009dd27b4e144de59979bd7770eaf57c/deepspeed/launcher/runner.py#L201

This is the check that needs to pass to pull in a hostfile on the launching node.

loadams commented 4 months ago

Hi @siddharth9820 - what is the status of this issue, were you able to see if the hostfiles were created correctly?

siddharth9820 commented 4 months ago

Hi @loadams, sorry I didn't have the bandwidth to investigate this issue further. I just chugged along with creating hostfiles named "hostfiles" and running one job at a time.

loadams commented 4 months ago

Thanks @siddharth9820 - please do let us know if you have time to investigate later