[BUG] Mixtral inference OOM

ShayDuane commented 8 months ago

Describe the bug I'm not sure if DeepSpeed needs to be adapted for Mixtral. When I tried using DeepSpeed inference for model inference, it didn't properly implement model parallelism. Instead, it attempted to load the complete model parameters on each GPU, which ultimately led to Out Of Memory (OOM) errors. However, when I use llama2 for inference, it does indeed implement model parallelism. So I'm wondering if the Mixtral model requires official adaptation? Additionally, when I deploy Mixtral using the MII library, model parallelism is also successfully implemented, with the model parameters correctly split across different GPUs. But when I directly use DeepSpeed inference, it fails. I'm not sure if it's because it requires official adaptation or if there's a problem with how I'm using it. Is there anyone who can provide some guidance?

To Reproduce

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import deepspeed

tokenizer = AutoTokenizer.from_pretrained("/workspace/shuaiqi/Model/Mixtral",use_fast=True, add_prefix_space=True)

model = AutoModelForCausalLM.from_pretrained("/workspace/shuaiqi/Model/Mixtral",torch_dtype=torch.float16)
model.eval()

model_engine = deepspeed.init_inference(model,
                                        mp_size=4, # Number of GPU
                                        dtype=torch.float16, # dtype of the weights (fp16)
                                        replace_method="auto", # Lets DS autmatically identify the layer to replace
                                        replace_with_kernel_inject=True, # replace the model with the kernel injector
)
print(model_engine)

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to('cuda')
outputs=model_engine.generate(**inputs,max_length=128,do_sample=True,top_k=50,top_p=0.95,temperature=0.9,use_cache=True)
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output)

The launch command is

CUDA_VISIBLE_DEVICES=0,1,3,4 deepspeed --num_nodes=1 --num_gpus=4 example.py

Expected behavior A clear and concise description of what you expected to happen.

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/workspace/shuaiqi/miniconda3/envs/Shay/lib/python3.11/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/workspace/shuaiqi/miniconda3/envs/Shay/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.6, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 1.97 TB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information): transformers 4.36.2 cuda 12.3 pytorch 2.1.2 deepspeed 0.12.6

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

mrwyattii commented 8 months ago

@ShayDuane we do not support Mixtral with the old inference interface. Please use DeepSpeed-MII to get support for the Mixtral model.

First install latest DeepSpeed and MII:

pip install deepspeed==0.12.6 deepspeed-mii==0.1.3

Then launch the following with deepspeed --num_gpus 4 mixtral.py:

import mii

pipe = mii.pipeline("/workspace/shuaiqi/Model/Mixtral")
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
if pipe.is_rank_0:
    print(responses[0])

Wanzizhu commented 6 months ago

@mrwyattii , will you add support Mixtral with the old inference interface in the future ? as it's quite a popular model.

leachee99 commented 3 months ago

hi, have DeepSpeed supported Mixtral(or other Moe models) with the old inference interface ? I tried to inference the Moe model(mixtral and Qwen1.5-MoE-A2.7B) with DeepSpeed in multi-Node, but it failed. Can anyone help me?

microsoft / DeepSpeed

[BUG] Mixtral inference OOM #4864