Embedding size 0 when using TrainingArguments & Deepspeed stage 3 with ```model.get_input_embedding()```

System Info

transformers version: 4.42.4
Platform: Linux-5.4.0-181-generic-x86_64-with-glibc2.31
Python version: 3.11.9
Huggingface_hub version: 0.23.5
Safetensors version: 0.4.3
Accelerate version: 0.32.1
PyTorch version (GPU): 2.3.0+cu121 (True)
GPU type: Tesla V100-SXM2-16GB

Who can help?

@muellerzr @muellerzr @SunMarc

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

My python script train_temp.py:

import torch
import transformers
from transformers import AutoModelForCausalLM
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)

parser = transformers.HfArgumentParser(TrainingArguments)
training_args = parser.parse_args_into_dataclasses()

# Load model and tokenizer
config = transformers.AutoConfig.from_pretrained(
    "path/to/Qwen2",
)

llm_model = AutoModelForCausalLM.from_pretrained(
    "path/to/Qwen2",
    config=config,
)

pretrained_embed = llm_model.get_input_embeddings()
print(pretrained_embed)                # Embedding(152064, 3584)
print(pretrained_embed.weight.shape)   # torch.Size([0])

out = pretrained_embed(torch.ones((1, 1024), dtype=torch.int))
print(out.shape)

My running script:

DISTRIBUTED_ARGS="
    --nproc_per_node 4 \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr localhost \
    --master_port 6001
"

DS_CONFIG_PATH="ds_config_zero3.json"

torchrun $DISTRIBUTED_ARGS train_temp.py \
    --output_dir output_20240712_1 \
    --deepspeed ${DS_CONFIG_PATH}

My ds_config_zero3.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Error:

[rank3]: Traceback (most recent call last):
[rank3]:   File "/mnt9/fangrui/qwen_bert/train_temp.py", line 28, in <module>
[rank3]:     out = pretrained_embed(torch.ones((1, 1024), dtype=torch.int))
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/fr450273/miniconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/fr450273/miniconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/fr450273/miniconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 163, in forward
[rank3]:     return F.embedding(
[rank3]:            ^^^^^^^^^^^^
[rank3]:   File "/home/fr450273/miniconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/functional.py", line 2264, in embedding
[rank3]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: RuntimeError: 'weight' must be 2-D

When delete TrainingArguments part, the embedding size return normal.

# Delete
@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)

parser = transformers.HfArgumentParser(TrainingArguments)
training_args = parser.parse_args_into_dataclasses()

Expected behavior

print(pretrained_embed.weight.shape)   # torch.Size([152064, 3584])
print(out.shape)    # torch.Size([1, 1024, 3584])

@ojipadeson I can help you here since I've faced pretty much the same problem recently. Turns out this behavior is normal when using DeepSpeed ZeRO-3, because it does parameter offloading on different devices (GPU or CPU). As described in DeepSpeed's documentation, if you try to access some model parameters (in this case the embedding layer) outside its forward() method, there's a chance they'll appear as empty (as in your case) because they have been offloaded (=moved) to another device.

When you try to access the same parameters inside the model's forward() method, DeepSpeed automatically fetches them from whatever device they were offloaded to. On the other hand, if you want to access them outside the forward() method, then you have to manually gather them (that's the technical term) using deepspeed.zero.GatheredParameters. Try the following:

import deepspeed
import torch
import transformers
from transformers import AutoModelForCausalLM
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)

parser = transformers.HfArgumentParser(TrainingArguments)
training_args = parser.parse_args_into_dataclasses()

# Load model and tokenizer
config = transformers.AutoConfig.from_pretrained(
    "path/to/Qwen2",
)

llm_model = AutoModelForCausalLM.from_pretrained(
    "path/to/Qwen2",
    config=config,
)

pretrained_embed = llm_model.get_input_embeddings()
with deepspeed.zero.GatheredParameters(pretrained_embed.weight, modifier_rank=0):
    print(pretrained_embed)                # Embedding(152064, 3584)
    print(pretrained_embed.weight.shape)   # torch.Size([0])

    out = pretrained_embed(torch.ones((1, 1024), dtype=torch.int))
    print(out.shape)

huggingface / transformers