Open siddharth9820 opened 6 months ago
Hi @siddharth9820, this is interesting. Can you double check the hostfiles with the slurm ids are being created and accessible at the path you are launching from?
This is the check that needs to pass to pull in a hostfile on the launching node.
Hi @siddharth9820 - what is the status of this issue, were you able to see if the hostfiles were created correctly?
Hi @loadams, sorry I didn't have the bandwidth to investigate this issue further. I just chugged along with creating hostfiles named "hostfiles" and running one job at a time.
Thanks @siddharth9820 - please do let us know if you have time to investigate later
Describe the bug I am trying to launch multiple Megatron-DeepSpeed jobs on a slurm based cluster. For each job, I want to create a different hostfile called hostfile_${SLURMJOBID}. However, when I tried to do this along with deepspeed --hostfile=hostfile${SLURM_JOBID}, I saw from the logs that deepspeed was unable to detect the hostfile and set the worldsize to 4 (which is the number of GPUs on a single node) erroneously. Interestingly, creating a hostfile with the name 'hostfile' does not lead to this error.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context
Docker context n/A
Additional context Add any other context about the problem here.