[Bug] DeepSpeed Inference Does not Work with LLaMA (Latest verison)

allanj commented 4 months ago

Version

deepspeed: 0.13.4 transformers: 4.38.1 Python: 3.10 Pytorch: 2.1.2+cu121 CUDA: 12.1

Error in Example (To reproduce)

Just simply run this script https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py

deepspeed --num_gpus 8 inference-test.py --model meta-llama/Llama-2-7b-hf --use-kernel

It will show the following error:

Traceback (most recent call last):
  File "/root/DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py", line 82, in <module>
    outputs = pipe(inputs,
  File "/root/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__
    outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)
  File "/root/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 119, in generate_outputs
    outputs = self.model.generate(input_tokens.input_ids, **generate_kwargs)
  File "/miniconda/envs/py310/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 636, in _generate
    return self.module.generate(*inputs, **kwargs)
  File "/miniconda/envs/py310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/miniconda/envs/py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/miniconda/envs/py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 2693, in sample
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  File "/miniconda/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1256, in prepare_inputs_for_generation
    if past_key_value := getattr(self.model.layers[0].self_attn, "past_key_value", None):
  File "/miniconda/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DeepSpeedGPTInference' object has no attribute 'self_attn'

Potential bug?

I suspect it did not find the right inference engine?, which should be DeepSpeedLlamaInference but not DeepSpeedGPTInference?

allanj commented 4 months ago

Would be nice if someone can tell which version will work

mrwyattii commented 4 months ago

Hi @allanj I don't think we have kernel injection support for llama-2 models. If you remove the --use_kernel flag does the script work?

Additionally, what kind of GPUs are you using? You may be able to utilize DeepSpeed-MII to run the llama-2 model and get significant improvements to inference performance if you have GPUs with compute capability >=8.0:

import mii

client = mii.serve("meta-llama/Llama-2-7b-hf", tensor_parallel=8)
response = client("test prompt")

allanj commented 4 months ago

Yes. Removing the --use_kernel make it work.

Yeah, I realize the DeepSpeed FastGen. Wondering, how does it support the batch size? Or I simply make a for loop about that

microsoft / DeepSpeedExamples