Multi node or remote machine inference doesn't work without "--force_multi" parameter

sarathkondeti commented 9 months ago

I trying to figure out why my script doesn't work without "--force_multi" param in ds_launch_str https://github.com/microsoft/DeepSpeed-MII/blob/0fe4eb86b93e8210736f3e8c671bc886af64fd67/mii/server.py#L116

Expected parallelism: replicas 4 and TP 1 Hardware: 2 machines with 2xTesla T4s each hostfile: localhost slots=2 strange slots=2

my script:

import mii
mii_configs = {
                "tensor_parallel": 1,
                "dtype": "fp16",
                "replace_with_kernel_inject" : False,
                "load_with_sys_mem": True,
                "replica_num" : 4,
                "hostfile": "/home/anurag_dutt/vicuna_parallelism/hostfile",
            }
deployment = "vicuna_deployment"
mii.deploy(task='text-generation',
           model="lmsys/vicuna-7b-v1.5",
           model_path="/home/anurag_dutt/.cache/huggingface/hub",
           deployment_name=deployment,
           deployment_type=mii.DeploymentType.LOCAL,
           mii_config=mii_configs,
        )

I've also tried running a simple deepspeed script (TP 1) on remote machine('strange') using the below hostfile and run command. Hostfile: strange slots=1

run command: (doesn't work) deepspeed --hostfile ~/vicuna_parallelism/hostfile tp.py

run command: (works) deepspeed --force_multi --hostfile ~/vicuna_parallelism/hostfile tp.py

mrwyattii commented 9 months ago

Hi @sarathkondeti thanks for reporting this bug. I will try to replicate this on a local system. Just to clarify, are you able to run the MII deployment if you add --force_multi to the ds_launch_str?

sarathkondeti commented 9 months ago

Yea, I rebuilt mii after adding that parameter, it works for me now.

mrwyattii commented 9 months ago

Thank you. I will work on a fix for this and it will be in the next MII release.

AskMrYogi commented 9 months ago

+1 Same bug .. set up 12 v100 nodes

mrwyattii commented 9 months ago

@sarathkondeti and @AskMrYogi after taking a closer look at the DeepSpeed launcher code: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/runner.py#L489

I think the reason you must add --force_multi is because your hostfile is not being parsed correctly. Can you share what your hostfile looks like (or if you are not providing one to MII)? Thank you!

AskMrYogi commented 9 months ago

hostfile looks like this

hostname1 slots=2 hostname2 slots=2

mrwyattii commented 9 months ago

Ok that looks correct, so there must be a bug somewhere. I will try to replicate on my side and find a fix. Thank you!

Anditty commented 8 months ago

Maybe because of this:

 #worker_str = f"-H {hostfile} "
  worker_str = ""

in server.py 175-176

Anditty commented 8 months ago

Maybe because of this:
 #worker_str = f"-H {hostfile} "
  worker_str = ""
in server.py 175-176

Use worker_str = f"-H {hostfile} instead of worker_str = ""

microsoft / DeepSpeed-MII

Multi node or remote machine inference doesn't work without "--force_multi" parameter #243