microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.87k stars 175 forks source link

waiting for server to start... #222

Open yunll opened 1 year ago

yunll commented 1 year ago

Hello, I start deploy in one node with 4GPU, and set tensor_parallel 2. program is always wating for server to start

image

code is:

image

hostfile is: 127.0.0.1 slots=2

mrwyattii commented 1 year ago

@yunll are you able to see any GPU memory usage (via nvidia-smi)? I am wondering if there is a problem loading the model. Either way, I think we could improve the feedback to user to be more descriptive with what the server is doing in the background.

Also, could you try without the grpc server? set deployment_type=mii.DeploymentType.NON_PERSISTENT in your call to mii.deploy() and launch with deepspeed --num_gpus 2 your_script.py

infosechoudini commented 1 year ago

I'm facing the same issue with both persistent and non-persistent deployments. It's not loading the model on the GPUs. I've tried deepspeed and zero2 and zero3.

model_id = "codellama/CodeLlama-7b-Instruct-hf"
model_path = ".cache/huggingface/hub/"
mii_configs = {"tensor_parallel": 5, 
                "dtype": "fp16", 
                "trust_remote_code": True}

mii.deploy(task="text-generation",
        model=model_id,
        deployment_name="mii",
        model_path=model_path + model_id,
        mii_config=mii_configs,
        enable_deepspeed=True,
        enable_zero=False,
        deployment_type=mii.constants.DeploymentType.NON_PERSISTENT
    )
infosechoudini commented 1 year ago
[2023-09-04 12:23:19,159] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
shared memory (/dev/shm) size .... 28.73 GB
mrwyattii commented 1 year ago

I'm facing the same issue with both persistent and non-persistent deployments. It's not loading the model on the GPUs. I've tried deepspeed and zero2 and zero3.

@infosechoudini what behavior are you seeing when you load the model with non-persistent deployment type or just using DeepSpeed? Does a simple script like the following run for you?

import torch
import deepspeed
import os
from transformers import pipeline

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))

task_name = "text-generation"
model_name = "gpt2"
input_strs = ["DeepSpeed is", "Microsoft is"]

def run():
    pipe = pipeline(task_name, model_name, torch_dtype=torch.float16, device=local_rank)

    pipe.model = deepspeed.init_inference(
        pipe.model,
        replace_with_kernel_inject=True,
        mp_size=world_size,
        dtype=torch.float16,
    )

    output = pipe(input_strs)
    print(output)

if __name__ == "__main__":
    run()

Run with deepspeed script.py

infosechoudini commented 1 year ago

Hey,

Deepspeed works fine. I just finished training a model with deepspeed yesterday. I was messing around with it but couldnt find a solution.

It just hangs on waiting for server to start then crashes after it times out.

mrwyattii commented 1 year ago

@infosechoudini I want to determine if there is a bug in MII or a problem in your environment that is causing this hang. I see that you are setting "tensor_parallel": 5. I have seen issues in the past with model sharding when using an odd number of GPUs. Could you try running with 4 GPUs?

Quang-elec44 commented 1 year ago

Hi @mrwyattii , may I ask how to keep the restful server alive ? Here is my script

import mii

mii_configs = {
    "tensor_parallel": 2, 
    "dtype": "fp16",
    "enable_restful_api": True, 
    "restful_api_port": 35215,
    "skip_model_check": True
}
mii.deploy(task="text-generation",
           model="/path/to/my/model",
           deployment_name=MY_DEPLOYMENT",
           mii_config=mii_configs,
           deployment_type=mii.DeploymentType.NON_PERSISTENT
           )

It seems that after I ran deepspeed --num_gpus 2 api.py, the process just exited. The model was loaded on GPUs, but the server did not stay alive. Can you help me out ?

moonbucks commented 5 months ago

Hi @mrwyattii what could be the potential reasons for server.py to keep waiting for server to start? when I ran the server with the test.py you gave, it seemed to work, so I guess model is not the problem that makes this problem, but when I run server.py, it waits until the server is live. nvidia-smi shows 448MB is used for each GPU while I try to load the 7b model, so i guess the model is not properly loaded. but why is it different between persistent and non-persistent deployment?