[BUG] DeepSpeed is loads the whole model to every GPUs instead of partitioning

Describe the bug Deepspeed loads the whole model to every GPUs. When running Llama2-13b in full precision:

To Reproduce I followed the tutorial in https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/

import os
import torch
import transformers
import deepspeed
from deepspeed.runtime.utils import see_memory_usage

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "4"))
# create the model pipeline
pipe = transformers.pipeline(task="text-generation", model="meta-llama/Llama-2-13b-hf", device=local_rank)
print(f"[RANK{local_rank}] MEM ALLCOATED: {torch.cuda.memory_allocated() / 1024 / 1024 / 1024:.2f}")
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
    pipe.model,
    tensor_parallel={"tp_size": world_size},
    dtype=torch.float,
)
see_memory_usage("after init_inference", True)
input_sentences = [
         "DeepSpeed is a machine learning framework",
         "He is working on",
         "He has a",
         "He got all",
         "Everyone is happy and I can",
         "The new movie that got Oscar this year",
         "In the far far distance from our galaxy,",
         "Peace is the only way"
]
output = pipe(input_sentences, num_return_sequences=1, max_length=100)
print(f"RANK: {local_rank}: {output[0][0]['generated_text']}")

command deepspeed --num_gpus 4 inference_test.py

I read other issue reports and

one said use injection_policy, which should be removed in the latest version with AutoTP
another one said to use zero, I tried stage 3 but didn't work.

ds_report output

[2024-05-30 17:33:23,951] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The default cache directory for DeepSpeed Triton autotune, /home/yunho/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 2.3.0
deepspeed info ................... 0.14.3+2fc702ed, 2fc702ed, master
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 2.3, cuda 12.1
shared memory (/dev/shm) size .... 472.44 GB

System info (please complete the following information):

OS: ubuntu 22.04
GPU count and types: 1 node with 8 A100s.
transformers 4.41.2
Python 3.10

I imagined deepspeed to allocate 13GB of model weights to each GPU. Is AutoTP duplicating weights because the whole model can fit inside a GPU?

microsoft / DeepSpeed

[BUG] DeepSpeed is loads the whole model to every GPUs instead of partitioning #5592