huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.13k stars 26.32k forks source link

Embedding size 0 when using TrainingArguments & Deepspeed stage 3 with ```model.get_input_embedding()``` #32021

Open ojipadeson opened 2 months ago

ojipadeson commented 2 months ago

System Info

Who can help?

@muellerzr @muellerzr @SunMarc

Information

Tasks

Reproduction

My python script train_temp.py:

import torch
import transformers
from transformers import AutoModelForCausalLM
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)

parser = transformers.HfArgumentParser(TrainingArguments)
training_args = parser.parse_args_into_dataclasses()

# Load model and tokenizer
config = transformers.AutoConfig.from_pretrained(
    "path/to/Qwen2",
)

llm_model = AutoModelForCausalLM.from_pretrained(
    "path/to/Qwen2",
    config=config,
)

pretrained_embed = llm_model.get_input_embeddings()
print(pretrained_embed)                # Embedding(152064, 3584)
print(pretrained_embed.weight.shape)   # torch.Size([0])

out = pretrained_embed(torch.ones((1, 1024), dtype=torch.int))
print(out.shape)

My running script:

DISTRIBUTED_ARGS="
    --nproc_per_node 4 \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr localhost \
    --master_port 6001
"

DS_CONFIG_PATH="ds_config_zero3.json"

torchrun $DISTRIBUTED_ARGS train_temp.py \
    --output_dir output_20240712_1 \
    --deepspeed ${DS_CONFIG_PATH}

My ds_config_zero3.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Error:

[rank3]: Traceback (most recent call last):
[rank3]:   File "/mnt9/fangrui/qwen_bert/train_temp.py", line 28, in <module>
[rank3]:     out = pretrained_embed(torch.ones((1, 1024), dtype=torch.int))
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/fr450273/miniconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/fr450273/miniconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/fr450273/miniconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 163, in forward
[rank3]:     return F.embedding(
[rank3]:            ^^^^^^^^^^^^
[rank3]:   File "/home/fr450273/miniconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/functional.py", line 2264, in embedding
[rank3]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: RuntimeError: 'weight' must be 2-D

When delete TrainingArguments part, the embedding size return normal.

# Delete
@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)

parser = transformers.HfArgumentParser(TrainingArguments)
training_args = parser.parse_args_into_dataclasses()

Expected behavior

print(pretrained_embed.weight.shape)   # torch.Size([152064, 3584])
print(out.shape)    # torch.Size([1, 1024, 3584])
anferico commented 2 months ago

@ojipadeson I can help you here since I've faced pretty much the same problem recently. Turns out this behavior is normal when using DeepSpeed ZeRO-3, because it does parameter offloading on different devices (GPU or CPU). As described in DeepSpeed's documentation, if you try to access some model parameters (in this case the embedding layer) outside its forward() method, there's a chance they'll appear as empty (as in your case) because they have been offloaded (=moved) to another device.

When you try to access the same parameters inside the model's forward() method, DeepSpeed automatically fetches them from whatever device they were offloaded to. On the other hand, if you want to access them outside the forward() method, then you have to manually gather them (that's the technical term) using deepspeed.zero.GatheredParameters. Try the following:

import deepspeed
import torch
import transformers
from transformers import AutoModelForCausalLM
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)

parser = transformers.HfArgumentParser(TrainingArguments)
training_args = parser.parse_args_into_dataclasses()

# Load model and tokenizer
config = transformers.AutoConfig.from_pretrained(
    "path/to/Qwen2",
)

llm_model = AutoModelForCausalLM.from_pretrained(
    "path/to/Qwen2",
    config=config,
)

pretrained_embed = llm_model.get_input_embeddings()
with deepspeed.zero.GatheredParameters(pretrained_embed.weight, modifier_rank=0):
    print(pretrained_embed)                # Embedding(152064, 3584)
    print(pretrained_embed.weight.shape)   # torch.Size([0])

    out = pretrained_embed(torch.ones((1, 1024), dtype=torch.int))
    print(out.shape)
github-actions[bot] commented 4 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.