microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.3k stars 4.09k forks source link

mpt model fails to run on cpu. #4774

Closed KepingYan closed 11 months ago

KepingYan commented 11 months ago

Describe the bug When I ran mpt model on the CPU, I encountered the following error. image

To Reproduce run-mpt-ds.py

parser = argparse.ArgumentParser('generation script', add_help=False)
parser.add_argument("-m", "--model-id", type=str, default="EleutherAI/gpt-j-6B")
parser.add_argument("-t", "--tokenizer-id", type=str, default="EleutherAI/gpt-j-6B")
parser.add_argument('--device', type=str, default='cpu')
parser.add_argument('--max-new-tokens', default=32, type=int, help="output max new tokens")
parser.add_argument('--jit', action='store_true')
parser.add_argument('--local_rank', default=None, type=int, help="local rank")
parser.add_argument("--batch-size", default=1, type=int, help="batch size")
args = parser.parse_args()

device = torch.device(args.device)
if args.device == 'cpu':
    replace_with_kernel_inject = False
elif args.device == 'xpu':
    replace_with_kernel_inject = False
else:
    replace_with_kernel_inject = True
generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4)
amp_enabled = True
amp_dtype = torch.bfloat16

# load model
model = AutoModelForCausalLM.from_pretrained(args.model_id, low_cpu_mem_usage=True,
                                             return_dict=not args.jit, torch_dtype=amp_dtype, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_id, trust_remote_code=True)
model = model.eval().to(device)
model = model.to(memory_format=torch.channels_last)

# deepspeed engine
ds_local_rank = int(os.getenv('LOCAL_RANK', '0'))
ds_world_size = int(os.getenv('WORLD_SIZE', '0'))

if ds_world_size == 0:
    ds_world_size = 1
engine = deepspeed.init_inference(model=model, mp_size=ds_world_size, dtype=amp_dtype,
                                  replace_method="auto", replace_with_kernel_inject=replace_with_kernel_inject)
model = engine.module

prompt = "Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun."
input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1)
# start
prompt = [prompt] * args.batch_size
total_list = []
with torch.inference_mode(), torch.no_grad(), torch.autocast(
    device_type=args.device,
    enabled=amp_enabled,
    dtype=amp_dtype if amp_enabled else None
):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    output = model.generate(input_ids, max_new_tokens=args.max_new_tokens, **generate_kwargs)

run-mpt.sh

#! /usr/bin/env bash
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
export HF_EVALUATE_OFFLINE=1
export CCL_ZE_IPC_EXCHANGE=sockets
# source envs for building extension module deepspeed_ccl_comm
source /opt/intel/oneapi/setvars.sh
SCRIPT_NAME=run-mpt-ds.py

# test passed
deepspeed --num_gpus 2 --bind_cores_to_rank -- $SCRIPT_NAME --device cpu --model-id "facebook/opt-125m" --tokenizer-id "facebook/opt-125m" --max-new-tokens 32 
# mpt error
deepspeed --num_gpus 2 --bind_cores_to_rank -- $SCRIPT_NAME --device cpu --model-id "mosaicml/mpt-7b" --tokenizer-id "EleutherAI/gpt-neox-20b" --max-new-tokens 32

package version

intel-extension-for-pytorch 2.1.0+cpu                pypi_0    pypi
torch                     2.1.0+cpu                pypi_0    pypi
deepspeed                 0.10.2                   pypi_0    pypi
transformers              4.31.0                   pypi_0    pypi
accelerate                0.21.0                   pypi_0    pypi

ds_report output

[2023-12-05 17:56:53,237] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ykp/miniconda3/envs/LLM_release_2/lib/python3.8/site-packages/torch']
torch version .................... 2.1.0+cpu
deepspeed install path ........... ['/home/ykp/miniconda3/envs/LLM_release_2/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.2, unknown, unknown
deepspeed wheel compiled w. ...... torch 2.0
shared memory (/dev/shm) size .... 220.00 GB

System info (please complete the following information):

tjruwase commented 11 months ago

@delock, can you please help?

delock commented 11 months ago

@delock, can you please help?

The direct reason is kv_n_heads and d_model needs to be added to the list in tensor sharded on https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/auto_tp.py#L387 . But I still see the result is not correct after the fix. So there are some other issues with MPT, probably due to remote modeling code change. Needs further investigation.

delock commented 11 months ago

Thanks @sywangyi !

@KepingYan can you verify whether https://github.com/microsoft/DeepSpeed/pull/4787 fix the issue? Thanks!