Open Quang-elec44 opened 1 year ago
+1 to be in loop. I suspect if it occurred because some deepspeed process kept running as you made multiple runs and then went OOM.
@satpalsr Not really! My first attempt with --num_gpus >3
failed without any previous run
I see that in your code the output is printed twice, once per GPU. Why is that? How to run the inference only once per example?
Also, is deepspeed inference supposed to copy over the model to all devices? On the tutorial page I see the following:
It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory.
How is it helping with this?
@udhavsethi According to the tutorial page, at this part, you can get the result from rank 0. About model parallelism, in my experience, it didn't work as I expected. It did split the model among GPUs, however, the total memory was higher than when I used only one GPU. Besides, the model was not equally split, the rank 0 gpu consumed more memory than others.
I see that in your code the output is printed twice, once per GPU. Why is that? How to run the inference only once per example?
Also, is deepspeed inference supposed to copy over the model to all devices? On the tutorial page I see the following:
It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory.
How is it helping with this?
Hi, I am getting two outputs too instead of one, have you sorted out this issue?
Describe the bug When I run my inference code with
deepspeed.init_inference()
. It only works a few times with num_gpus=2 (num_gpus>2 always fails, num_gpus=2 sometimes fails). Following this link https://www.deepspeed.ai/tutorials/inference-tutorial/To Reproduce Steps to reproduce the behavior:
git clone https://github.com/microsoft/DeepSpeed/ cd DeepSpeed rm -rf build TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_TRANSFORMER_INFERENCE=1 pip install . \ --global-option="build_ext" --global-option="-j8" --no-cache -v \ --disable-pip-version-check 2>&1 | tee build.log
run.py
import deepspeed import torch import os from transformers import AutoModel
local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1'))
model = AutoModel.from_pretrained("vinai/phobert-base").eval() model.to(local_rank)
Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(model, mp_size=world_size, checkpoint=None, dtype=torch.float)
model = ds_engine.module
input_ids = torch.LongTensor([[0, 1, 2, 3]]).to(local_rank)
with torch.no_grad(): output = model(input_ids=input_ids)
print(output.last_hidden_state)
deepspeed --num_gpus=2 run.py
(py38) root@quangthd-7c6fb44f48-qfdg7:/home/workspace/train_exp# deepspeed --num_gpus=2 run_1.py [2023-04-11 02:37:20,770] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-11 02:37:20,864] [INFO] [runner.py:527:main] cmd = /root/miniconda3/envs/py38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_1.py [2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7 [2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NCCL_VERSION=2.13.4-1 [2023-04-11 02:37:23,948] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1 [2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7 [2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-04-11 02:37:23,949] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1 [2023-04-11 02:37:23,949] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0, 1]} [2023-04-11 02:37:23,949] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-04-11 02:37:23,949] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-04-11 02:37:23,949] [INFO] [launch.py:151:main] dist_world_size=2 [2023-04-11 02:37:23,949] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2023-04-11 02:37:35,509] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master [2023-04-11 02:37:35,515] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-04-11 02:37:35,516] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 [2023-04-11 02:37:35,520] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-04-11 02:37:35,728] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master [2023-04-11 02:37:35,730] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-04-11 02:37:35,731] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 AutoTP: [(<class 'transformers.models.roberta.modeling_roberta.RobertaLayer'>, ['output.dense'])] AutoTP: [(<class 'transformers.models.roberta.modeling_roberta.RobertaLayer'>, ['output.dense'])] tensor([[[-0.1506, 0.1654, -0.2740, ..., -0.0372, 0.0528, -0.4088], [-0.0466, 0.0702, 0.1521, ..., -0.3913, 0.2049, -0.2464], [-0.2765, -0.1203, -0.7273, ..., -0.2264, -0.0797, -0.4175], [-0.2209, -0.2705, -0.7297, ..., -0.3494, 0.2433, -0.7732]]], device='cuda:1') tensor([[[-0.1506, 0.1654, -0.2740, ..., -0.0372, 0.0528, -0.4088], [-0.0466, 0.0702, 0.1521, ..., -0.3913, 0.2049, -0.2464], [-0.2765, -0.1203, -0.7273, ..., -0.2264, -0.0797, -0.4175], [-0.2209, -0.2705, -0.7297, ..., -0.3494, 0.2433, -0.7732]]], device='cuda:0') [2023-04-11 02:37:39,011] [INFO] [launch.py:329:main] Process 12518 exits successfully. [2023-04-11 02:37:39,012] [INFO] [launch.py:329:main] Process 12517 exits successfully.
(py38) root@quangthd-7c6fb44f48-qfdg7:/home/workspace/train_exp# deepspeed --num_gpus=3 run_1.py [2023-04-11 02:38:38,952] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-11 02:38:39,050] [INFO] [runner.py:527:main] cmd = /root/miniconda3/envs/py38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_1.py [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NCCL_VERSION=2.13.4-1 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-04-11 02:38:41,915] [INFO] [launch.py:126:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1 [2023-04-11 02:38:41,916] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0, 1, 2]} [2023-04-11 02:38:41,916] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=3, node_rank=0 [2023-04-11 02:38:41,916] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]}) [2023-04-11 02:38:41,916] [INFO] [launch.py:151:main] dist_world_size=3 [2023-04-11 02:38:41,916] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0,1,2 [2023-04-11 02:39:07,402] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13678 [2023-04-11 02:39:07,603] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.8.3+4d27225f, git-hash=4d27225f, git-branch=master [2023-04-11 02:39:07,605] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2023-04-11 02:39:07,606] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 [2023-04-11 02:39:07,622] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13679 [2023-04-11 02:39:07,840] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 13680 [2023-04-11 02:39:07,840] [ERROR] [launch.py:303:sigkill_handler] ['/root/miniconda3/envs/py38/bin/python', '-u', 'run_1.py', '--local_rank=2'] exits with return code = -9
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/envs/py38/lib/python3.8/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/root/miniconda3/envs/py38/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.8.3+4d27225f, 4d27225f, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7