import os
import torch
import transformers
import deepspeed
from deepspeed.runtime.utils import see_memory_usage
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "4"))
# create the model pipeline
pipe = transformers.pipeline(task="text-generation", model="meta-llama/Llama-2-13b-hf", device=local_rank)
print(f"[RANK{local_rank}] MEM ALLCOATED: {torch.cuda.memory_allocated() / 1024 / 1024 / 1024:.2f}")
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
pipe.model,
tensor_parallel={"tp_size": world_size},
dtype=torch.float,
)
see_memory_usage("after init_inference", True)
input_sentences = [
"DeepSpeed is a machine learning framework",
"He is working on",
"He has a",
"He got all",
"Everyone is happy and I can",
"The new movie that got Oscar this year",
"In the far far distance from our galaxy,",
"Peace is the only way"
]
output = pipe(input_sentences, num_return_sequences=1, max_length=100)
print(f"RANK: {local_rank}: {output[0][0]['generated_text']}")
command
deepspeed --num_gpus 4 inference_test.py
I read other issue reports and
one said use injection_policy, which should be removed in the latest version with AutoTP
another one said to use zero, I tried stage 3 but didn't work.
ds_report output
[2024-05-30 17:33:23,951] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The default cache directory for DeepSpeed Triton autotune, /home/yunho/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.0), only 1.0.0 is known to be compatible
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 2.3.0
deepspeed info ................... 0.14.3+2fc702ed, 2fc702ed, master
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 2.3, cuda 12.1
shared memory (/dev/shm) size .... 472.44 GB
System info (please complete the following information):
OS: ubuntu 22.04
GPU count and types: 1 node with 8 A100s.
transformers 4.41.2
Python 3.10
I imagined deepspeed to allocate 13GB of model weights to each GPU. Is AutoTP duplicating weights because the whole model can fit inside a GPU?
Describe the bug Deepspeed loads the whole model to every GPUs. When running Llama2-13b in full precision:
To Reproduce I followed the tutorial in https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/
command
deepspeed --num_gpus 4 inference_test.py
I read other issue reports and
ds_report output
System info (please complete the following information):
I imagined deepspeed to allocate 13GB of model weights to each GPU. Is AutoTP duplicating weights because the whole model can fit inside a GPU?