[BUG] LLaMA Invalid Output When Multi-GPUs or Multi-Sequences (0.9.3)

78 commented 1 year ago

Describe the bug Deepspeed (0.9.3) inference works fine with a single GPU (Tesla A30 24G), but gives invalid output with multiple GPUs (by setting --num_gpus 2). Test model: OpenBuddy 7B (LLaMA based)

To Reproduce

I used the script from the tutorial.

# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='/data/openbuddy-7b-v1.4',
                     device=local_rank, torch_dtype=torch.bfloat16)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           replace_with_kernel_inject=True)

string = generator("DeepSpeed is", max_new_tokens=20)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

Run deepspeed --num_gpus 2 test.py. I got the output:

# deepspeed --num_gpus 2 test.py
Setting ds_accelerator to cuda (auto detect)
[2023-06-05 20:49:00,851] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-05 20:49:00,870] [INFO] [runner.py:555:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test.py
Setting ds_accelerator to cuda (auto detect)
[2023-06-05 20:49:02,382] [INFO] [launch.py:138:main] 0 NCCL_IB_TIMEOUT=23
[2023-06-05 20:49:02,382] [INFO] [launch.py:138:main] 0 NCCL_IB_RETRY_CNT=7
[2023-06-05 20:49:02,382] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-06-05 20:49:02,382] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-06-05 20:49:02,382] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-06-05 20:49:02,382] [INFO] [launch.py:163:main] dist_world_size=2
[2023-06-05 20:49:02,382] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Loading checkpoint shards: 100%|████████████████████████| 2/2 [00:06<00:00,  3.48s/it]
Loading checkpoint shards: 100%|████████████████████████| 2/2 [00:06<00:00,  3.49s/it]
[2023-06-05 20:50:11,784] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.3, git-hash=unknown, git-branch=unknown
[2023-06-05 20:50:11,785] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-06-05 20:50:11,785] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-06-05 20:50:11,802] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-05 20:50:11,802] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-05 20:50:11,867] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.3, git-hash=unknown, git-branch=unknown
[2023-06-05 20:50:11,868] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-06-05 20:50:11,869] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-06-05 20:50:11,885] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-05 20:50:11,885] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-05 20:50:11,885] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.10642623901367188 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.11112475395202637 seconds
[2023-06-05 20:50:13,246] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 11008, 'heads': 32, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.RMSNorm: 3>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-06, 'mp_size': 2, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 128, 'rotate_half': True, 'rotate_every_two': False, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GATED_SILU: 4>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False}
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.009526252746582031 seconds
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.009597301483154297 seconds
/usr/local/lib/python3.9/site-packages/transformers/generation/utils.py:1255: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
/usr/local/lib/python3.9/site-packages/transformers/generation/utils.py:1255: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
------------------------------------------------------
Free memory : 2.652649 (GigaBytes)
Total memory: 23.542236 (GigaBytes)
Requested memory: 0.718750 (GigaBytes)
Setting maximum total tokens (input + output) to 1024
WorkSpace: 0x7f7852000000
------------------------------------------------------
[{'generated_text': 'DeepSpeed is CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE'}]
[2023-06-05 20:50:15,468] [INFO] [launch.py:346:main] Process 12805 exits successfully.
[2023-06-05 20:50:15,468] [INFO] [launch.py:346:main] Process 12807 exits successfully.

Run deepspeed --num_gpus 1 test.py. It worked as expected.

Changed the script to pass an injection policy manually.


# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline
from transformers.models.llama.modeling_llama import LlamaDecoderLayer

local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1')) generator = pipeline('text-generation', model='/data/openbuddy-7b-v1.4', torch_dtype=torch.bfloat16, device=local_rank)

generator.model = deepspeed.init_inference(generator.model, mp_size=world_size, injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.down_proj')})

string = generator("DeepSpeed is", max_new_tokens=20) if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(string)

6. Then run `deepspeed --num_gpus 2 test.py`. It worked too.

**Expected behavior**

[{'generated_text': 'DeepSpeed is the key to helping you achieve your goal as it is a great way to get your vision right.'}]


**ds_report output**

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/usr/local/lib64/python3.9/site-packages/torch'] torch version .................... 2.0.1+cu118 deepspeed install path ........... ['/usr/local/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.9.3, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8



**System info (please complete the following information):**
 - OS: CentOS Stream 9
 - 2 Tesla A30 24G
 - Deepspeed 0.9.3
 - Transformers 4.29.2
 - Python 3.9.16

**Additional Information**
I have tested another model (https://huggingface.co/distilgpt2), and it worked fine with multiple GPUs.

xiangyuliu commented 1 year ago

I also encountered the same problem, and I have tried many methods but found no effective solution.

spigo900 commented 1 year ago

@xiangyuliu I ran into a similar issue but found a workaround here that solved the issue for me. The workaround is to set replace_with_kernel_inject=False, which makes inference slower but means the outputs come out valid. Maybe this helps.

ETA: Note that I used the latest stable deepspeed (0.9.5). I believe a fix was merged sometime after 0.9.2 to allow running LLaMA using AutoTP, and I don't know if that fix made it into 0.9.3.

KimmiShi commented 1 year ago

I reported the same issuse: https://github.com/microsoft/DeepSpeed/issues/3932; But replace_with_kernel_inject=False is much slower.

microsoft / DeepSpeed