Open sarathkondeti opened 9 months ago
Hi @sarathkondeti thanks for reporting this bug. I will try to replicate this on a local system. Just to clarify, are you able to run the MII deployment if you add --force_multi
to the ds_launch_str
?
Yea, I rebuilt mii after adding that parameter, it works for me now.
Thank you. I will work on a fix for this and it will be in the next MII release.
+1 Same bug .. set up 12 v100 nodes
@sarathkondeti and @AskMrYogi after taking a closer look at the DeepSpeed launcher code: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/runner.py#L489
I think the reason you must add --force_multi
is because your hostfile is not being parsed correctly. Can you share what your hostfile looks like (or if you are not providing one to MII)? Thank you!
hostfile looks like this
hostname1 slots=2 hostname2 slots=2
Ok that looks correct, so there must be a bug somewhere. I will try to replicate on my side and find a fix. Thank you!
Maybe because of this:
#worker_str = f"-H {hostfile} "
worker_str = ""
in server.py 175-176
Maybe because of this:
#worker_str = f"-H {hostfile} " worker_str = ""
in server.py 175-176
Use worker_str = f"-H {hostfile}
instead of worker_str = ""
I trying to figure out why my script doesn't work without "--force_multi" param in ds_launch_str https://github.com/microsoft/DeepSpeed-MII/blob/0fe4eb86b93e8210736f3e8c671bc886af64fd67/mii/server.py#L116
Expected parallelism: replicas 4 and TP 1 Hardware: 2 machines with 2xTesla T4s each hostfile: localhost slots=2 strange slots=2
my script:
I've also tried running a simple deepspeed script (TP 1) on remote machine('strange') using the below hostfile and run command. Hostfile: strange slots=1
run command: (doesn't work) deepspeed --hostfile ~/vicuna_parallelism/hostfile tp.py
run command: (works) deepspeed --force_multi --hostfile ~/vicuna_parallelism/hostfile tp.py