[BUG] OOM error when able to load using huggingface transformers

wj210 commented 1 year ago

OOM error when loading flan-tf-xxl model for inference. The model was able to load perfectly without deepspeed, just by using the standard code in huggingface transformers. It used approximately 20+ GB. The hardware used is 4x RTX A6000 45GB RAM each.

To Reproduce Simple script ran with the command deepspeed --num_gpus 4 main.py --name google/flan-t5-xxl --ds_inference --use_kernel --use_meta_tensor --checkpoint_path '/.cache/huggingface/hub/'

`from datasets import load_dataset,concatenate_datasets from transformers import T5Tokenizer, T5ForConditionalGeneration,pipeline import numpy as np import os import torch import deepspeed

local_rank = int(os.getenv("LOCAL_RANK", "0")) world_size = int(os.getenv("WORLD_SIZE", "1")) print (local_rank, world_size) model_id="google/flan-t5-xxl" tokenizer = T5Tokenizer.from_pretrained(model_id) model = T5ForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)

ds_engine = deepspeed.init_inference(model, mp_size=world_size, dtype=model.dtype, replace_method="auto", replace_with_kernel_inject=True) test = "test prompt" max_new_tokens = 100 generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=local_rank)

result = generator(test, do_sample=True, max_new_tokens=max_new_tokens)

if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(result)`

Expected behavior Unclear of why is there OOM error happening.

ds_report output `-------------------------------------------------- DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch'] torch version .................... 1.12.1+cu113 deepspeed install path ........... ['/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.2, unknown, unknown torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.2 deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3`

Screenshots ERROR LOG [2023-05-06 00:22:07,115] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-05-06 00:22:07,162] [INFO] [runner.py:541:main] cmd = /home/weijie/anaconda3/envs/flan/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --name google/flan-t5-xxl --ds_inference --use_kernel --use_meta_tensor --checkpoint_path /.cache/huggingface/hub/ [2023-05-06 00:22:09,143] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-05-06 00:22:09,143] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-05-06 00:22:09,143] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-05-06 00:22:09,143] [INFO] [launch.py:247:main] dist_world_size=4 [2023-05-06 00:22:09,143] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 0 4 3 4 2 4 1 4 Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:38<00:00, 7.70s/it] [2023-05-06 00:22:58,837] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown [2023-05-06 00:22:58,838] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference [2023-05-06 00:22:58,839] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-05-06 00:22:58,839] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00, 8.05s/it] [2023-05-06 00:23:00,771] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown [2023-05-06 00:23:00,773] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference [2023-05-06 00:23:00,773] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-05-06 00:23:00,773] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00, 8.06s/it] [2023-05-06 00:23:01,117] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown [2023-05-06 00:23:01,117] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference [2023-05-06 00:23:01,118] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-05-06 00:23:01,118] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:38<00:00, 7.80s/it] [2023-05-06 00:23:02,041] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown [2023-05-06 00:23:02,042] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference [2023-05-06 00:23:02,042] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-05-06 00:23:02,042] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 [2023-05-06 00:23:02,071] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Traceback (most recent call last): File "/home/weijie/flan/main.py", line 16, in <module> ds_engine = deepspeed.init_inference(model, File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/__init__.py", line 333, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 207, in __init__ self.module.to(device) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1749, in to return super().to(*args, **kwargs) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in to return self._apply(convert) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch/nn/modules/module.py", line 579, in _apply module._apply(fn) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch/nn/modules/module.py", line 579, in _apply module._apply(fn) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch/nn/modules/module.py", line 579, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch/nn/modules/module.py", line 602, in _apply param_applied = fn(param) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch/nn/modules/module.py", line 925, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 44.55 GiB total capacity; 8.72 GiB already allocated; 34.56 MiB free; 8.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/home/weijie/flan/main.py", line 16, in <module> ds_engine = deepspeed.init_inference(model, File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/__init__.py", line 333, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 211, in __init__ dist.broadcast(_rng_state, 0) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper return func(*args, **kwargs) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 217, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 118, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe [2023-05-06 00:23:07,200] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 3418330 [2023-05-06 00:23:07,200] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 3418331 Traceback (most recent call last): File "/home/weijie/flan/main.py", line 16, in <module> ds_engine = deepspeed.init_inference(model, File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/__init__.py", line 333, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 211, in __init__ dist.broadcast(_rng_state, 0) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper return func(*args, **kwargs) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 217, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 118, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: [2] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe [2023-05-06 00:23:08,150] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 3418332 [2023-05-06 00:23:08,979] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 3418333 [2023-05-06 00:23:08,993] [ERROR] [launch.py:434:sigkill_handler] ['/home/weijie/anaconda3/envs/flan/bin/python', '-u', 'main.py', '--local_rank=3', '--name', 'google/flan-t5-xxl', '--ds_inference', '--use_kernel', '--use_meta_tensor', '--checkpoint_path', '/.cache/huggingface/hub/'] exits with return code = 1

AbhayGoyal commented 1 year ago

I have been facing the same issue

pineking commented 1 year ago

the same issue

JH-Xie commented 1 year ago

The same issue, I have test "google/t5-v1_1-small" as offical demo， but noticed the GPU memory don't reduce when adding more gpu to inference. code is below

# ---------------------------------------
# New automatic tensor parallelism method
# ---------------------------------------
import os
import torch
import transformers
import deepspeed
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
# create the model pipeline
pipe = transformers.pipeline(task="text2text-generation", model="google/t5-v1_1-small", device=local_rank)
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
    pipe.model,
    mp_size=world_size,
    dtype=torch.float
)
output = pipe('Input String')

CoinCheung commented 1 year ago

Same problem

microsoft / DeepSpeed