microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.81k stars 3.97k forks source link

[Question] codellama 34b inference request. #4261

Open nhsjgczryf opened 10 months ago

nhsjgczryf commented 10 months ago

Is your feature request related to a problem? Please describe. NAN

Describe the solution you'd like NAN

Describe alternatives you've considered NAN

Additional context NAN

At present, the DeepSpeed inference engine does not support Code Llama 34B, are there any plans to incorporate support for it in the future?

### Tasks
awan-10 commented 9 months ago

@nhsjgczryf -- did you try running CodeLlama from HuggingFace with DS inference?

Based on this model type here: https://huggingface.co/codellama/CodeLlama-34b-hf/blob/main/config.json#L12, I think that it should just work.

Please try and let us know.

nhsjgczryf commented 9 months ago

@nhsjgczryf -- did you try running CodeLlama from HuggingFace with DS inference?

Based on this model type here: https://huggingface.co/codellama/CodeLlama-34b-hf/blob/main/config.json#L12, I think that it should just work.

Please try and let us know.

If I set replace_with_kernel_inject=False and mp_size=1, it works well.

However, if I set replace_with_kernel_inject=True, I will encounter this error:

[2023-09-22 10:36:19,393] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 8192, 'intermediate_size': 22016, 'heads': 64, 'num_hidden_layers': -1, 'dtype': torch.bfloat16, 'pre_layer_norm': True, 'norm_type': <NormType.RMSNorm: 3>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 128, 'rotate_half': True, 'rotate_every_two': False, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GATED_SILU: 4>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1} Traceback (most recent call last): File "/data/wangnan/code-pretrain/multi-task-pretrain/inference_ds.py", line 335, in model = deepspeed.init_inference( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 342, in init_inference engine = InferenceEngine(model, config=ds_inference_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/inference/engine.py", line 151, in init self._apply_injection_policy(config) File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/inference/engine.py", line 391, in _apply_injection_policy replace_transformer_layer(client_module, self.module, checkpoint, config, self.config) File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 313, in replace_transformer_layer replaced_module = replace_module(model=model, ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 556, in replace_module replacedmodule, = _replace_module(model, policy, state_dict=sd) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 624, in _replacemodule , layer_id = _replace_module(child, ^^^^^^^^^^^^^^^^^^^^^^ File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 624, in _replacemodule , layer_id = _replace_module(child, ^^^^^^^^^^^^^^^^^^^^^^ File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 600, in _replace_module replaced_module = policies[child.class][0](child, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 289, in replace_fn new_module = replace_with_policy(child, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 252, in replace_with_policy _container.apply_tensor_parallelism(mp_replace) File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/containers/features/meta_tensor.py", line 36, in apply_tensor_parallelism super().apply_tensor_parallelism(mp_replace, **kwargs) File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/containers/features/hybrid_engine.py", line 89, in apply_tensor_parallelism self.attention_qkv_mp(mp_replace, reversed_dim=reversed_dim) File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/containers/features/split_qkv.py", line 49, in attention_qkv_mp super().attention_qkv_mp(mp_replace) File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/containers/base.py", line 240, in attention_qkv_mp self.module.attention.attn_qkvw = mp_replace.strided_copy(self.module.attention.attn_qkvw, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/auto_tp.py", line 64, in strided_copy self.merge_assert(src_shape[outer_dim], dst_shape[self.out_dim]) File "/data/wangnan/miniconda3/lib/python3.11/site-packages/deepspeed/module_inject/auto_tp.py", line 31, in merge_assert assert dim1 > dim2, \ ^^^^^^^^^^^ AssertionError: Merging tensors is not allowed here! Please use deepspeed load_checkpoint for merging your checkpoints before replacing the transformer layer with inference-kernels

If I set mp_size=8, then the generation process will get stuck.

SevenFo commented 7 months ago

Hi, have you solved this issue? I had the same problem with CodeLlama series models. I use this example script to run the model. It works will when use_meta_tensor is not be set, otherwise it will throw the same error.