Quang-elec44 commented 1 year ago

Describe the bug When I run my inference code with deepspeed.init_inference(). It only works a few times with num_gpus=2 (num_gpus>2 always fails, num_gpus=2 sometimes fails). Following this link https://www.deepspeed.ai/tutorials/inference-tutorial/

To Reproduce Steps to reproduce the behavior:

Installation


conda create -n py38 -y python=3.8
conda activate py38 
conda install -c "nvidia/label/cuda-11.7.0" cuda-toolkit   # default installed nvcc in my machine is 10.1
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
pip install ninja

git clone https://github.com/microsoft/DeepSpeed/ cd DeepSpeed rm -rf build TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_TRANSFORMER_INFERENCE=1 pip install . \ --global-option="build_ext" --global-option="-j8" --no-cache -v \ --disable-pip-version-check 2>&1 | tee build.log

2. My inference script

run.py

import deepspeed import torch import os from transformers import AutoModel

local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1'))

model = AutoModel.from_pretrained("vinai/phobert-base").eval() model.to(local_rank)

Initialize the DeepSpeed-Inference engine

ds_engine = deepspeed.init_inference(model, mp_size=world_size, checkpoint=None, dtype=torch.float)

model = ds_engine.module

input_ids = torch.LongTensor([[0, 1, 2, 3]]).to(local_rank)

with torch.no_grad(): output = model(input_ids=input_ids)

print(output.last_hidden_state)

3. How to run the script

deepspeed --num_gpus=2 run.py


**Expected behavior**
Out when running successfully

(py38) root@quangthd-7c6fb44f48-qfdg7:/home/workspace/train_exp# deepspeed --num_gpus=2 run_1.py [2023-04-11 02:37:20,770] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-11 02:37:20,864] [INFO] [runner.py:527:main] cmd = /root/miniconda3/envs/py38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_1.py [2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7 [2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NCCL_VERSION=2.13.4-1 [2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1 [2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7 [2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1 [2023-04-11 02:37:23,949] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0, 1]} [2023-04-11 02:37:23,949] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-04-11 02:37:23,949] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-04-11 02:37:23,949] [INFO] [launch.py:151:main] dist_world_size=2 [2023-04-11 02:37:23,949] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2023-04-11 02:37:35,509] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master [2023-04-11 02:37:35,515] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-04-11 02:37:35,516] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 [2023-04-11 02:37:35,520] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-04-11 02:37:35,728] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master [2023-04-11 02:37:35,730] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-04-11 02:37:35,731] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 AutoTP: [(<class 'transformers.models.roberta.modeling_roberta.RobertaLayer'>, ['output.dense'])] AutoTP: [(<class 'transformers.models.roberta.modeling_roberta.RobertaLayer'>, ['output.dense'])] tensor([[[-0.1506, 0.1654, -0.2740, ..., -0.0372, 0.0528, -0.4088], [-0.0466, 0.0702, 0.1521, ..., -0.3913, 0.2049, -0.2464], [-0.2765, -0.1203, -0.7273, ..., -0.2264, -0.0797, -0.4175], [-0.2209, -0.2705, -0.7297, ..., -0.3494, 0.2433, -0.7732]]], device='cuda:1') tensor([[[-0.1506, 0.1654, -0.2740, ..., -0.0372, 0.0528, -0.4088], [-0.0466, 0.0702, 0.1521, ..., -0.3913, 0.2049, -0.2464], [-0.2765, -0.1203, -0.7273, ..., -0.2264, -0.0797, -0.4175], [-0.2209, -0.2705, -0.7297, ..., -0.3494, 0.2433, -0.7732]]], device='cuda:0') [2023-04-11 02:37:39,011] [INFO] [launch.py:329:main] Process 12518 exits successfully. [2023-04-11 02:37:39,012] [INFO] [launch.py:329:main] Process 12517 exits successfully.


**ds_report output**
To create bug, you can run the script several times or increase the ```--num_gpus```
Here is the log

(py38) root@quangthd-7c6fb44f48-qfdg7:/home/workspace/train_exp# deepspeed --num_gpus=3 run_1.py [2023-04-11 02:38:38,952] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-11 02:38:39,050] [INFO] [runner.py:527:main] cmd = /root/miniconda3/envs/py38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_1.py [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NCCL_VERSION=2.13.4-1 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1 [2023-04-11 02:38:41,916] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0, 1, 2]} [2023-04-11 02:38:41,916] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=3, node_rank=0 [2023-04-11 02:38:41,916] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]}) [2023-04-11 02:38:41,916] [INFO] [launch.py:151:main] dist_world_size=3 [2023-04-11 02:38:41,916] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0,1,2 [2023-04-11 02:39:07,402] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13678 [2023-04-11 02:39:07,603] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master [2023-04-11 02:39:07,605] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-04-11 02:39:07,606] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 [2023-04-11 02:39:07,622] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13679 [2023-04-11 02:39:07,840] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13680 [2023-04-11 02:39:07,840] [ERROR] [launch.py:303:sigkill_handler] ['/root/miniconda3/envs/py38/bin/python', '-u', 'run_1.py', '--local_rank=2'] exits with return code = -9


**ds_report**

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/root/miniconda3/envs/py38/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.8.3+4d27225f, 4d27225f, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7



**System info (please complete the following information):**
 - OS: Ubuntu 18.04.6 LTS
 - GPU: 1 machine 12 GPUs A16
 - Deepspeed version: 0.8.3
 - Transformers version: 4.27.4
 - Python version: 3.8.16

satpalsr commented 1 year ago

+1 to be in loop. I suspect if it occurred because some deepspeed process kept running as you made multiple runs and then went OOM.

Quang-elec44 commented 1 year ago

@satpalsr Not really! My first attempt with --num_gpus >3 failed without any previous run

udhavsethi commented 1 year ago

I see that in your code the output is printed twice, once per GPU. Why is that? How to run the inference only once per example?

Also, is deepspeed inference supposed to copy over the model to all devices? On the tutorial page I see the following:

It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory.

How is it helping with this?

Quang-elec44 commented 1 year ago

@udhavsethi According to the tutorial page, at this part, you can get the result from rank 0. About model parallelism, in my experience, it didn't work as I expected. It did split the model among GPUs, however, the total memory was higher than when I used only one GPU. Besides, the model was not equally split, the rank 0 gpu consumed more memory than others.

asifehmad commented 1 year ago

I see that in your code the output is printed twice, once per GPU. Why is that? How to run the inference only once per example?

Also, is deepspeed inference supposed to copy over the model to all devices? On the tutorial page I see the following:

It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory.

How is it helping with this?

Hi, I am getting two outputs too instead of one, have you sorted out this issue?

microsoft / DeepSpeed

[BUG] Inference failed serveral times #3181

run.py

Initialize the DeepSpeed-Inference engine

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible