Stuck on Initializing Transformers Model with FSDP (Fully Sharded Data Parallel) using meta device

jiangjiadi commented 1 month ago

System Info

transformers version: 4.41.2
Platform: Linux-4.9.151-015.ali3000.alios7.x86_64-x86_64-with-glibc2.17
Python version: 3.8.18
Huggingface_hub version: 0.23.2
Safetensors version: 0.4.1
Accelerate version: 0.30.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

text model: @ArthurZucker and @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Run Command: torchrun --nproc_per_node 2 test_fsdp.py

import torch
import os
import torch.distributed as dist
import functools
import transformers
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy
from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers.models.qwen2 import  Qwen2Config, Qwen2ForCausalLM
from transformers.models.qwen2.modeling_qwen2 import Qwen2DecoderLayer

local_rank = int(os.environ["LOCAL_RANK"])
print("local_rank:", local_rank)
torch.cuda.set_device(local_rank)

dist.init_process_group("nccl", init_method="env://")

config = Qwen2Config(
    hidden_size=1024,
    intermediate_size=2816,
    num_hidden_layers=24,
    num_attention_heads=16,
    num_key_value_heads=16,
    max_window_layers=21,
    rope_theta=1000000.0,
    tie_word_embeddings=True,
)
if local_rank == 0:
    print(config)

config.use_cache = False

if local_rank != 0:
    with torch.device("meta"):
        model = Qwen2ForCausalLM._from_config(config)
else:
    model = Qwen2ForCausalLM._from_config(config)
print(f"rank {local_rank}: Model is difinited.")
model = FSDP(
    model,
    auto_wrap_policy=functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls={
            Qwen2DecoderLayer,
        },
    ),
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    cpu_offload=CPUOffload(offload_params=True),
    device_id=torch.cuda.current_device(),
    limit_all_gathers=False,
    sync_module_states=True,
    param_init_fn=lambda module: module.to_empty(device=torch.device("cuda"), recurse=False)
    if local_rank != 0 else None,
    use_orig_params=True,
)

if local_rank != 0:
    print(">>>> Created Model.")
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f">>> The model has {trainable_params / 1e6} M  trainable parameters")

Expected behavior

When tie_word_embeddings=False is set, the code behaves normally. However, when I set tie_word_embeddings=True, rank 0 exits normally, but rank 1 gets stuck. The point where it gets stuck is shown in the following image. (When using accelerate, the behavior is the same.)

younesbelkada commented 1 month ago

Thanks for the reproducer, looking into it now

younesbelkada commented 1 month ago

Hi @jiangjiadi I have spent some time looking into the issue and I was able to reproduce. Interestingly the script worked if you never init the model on the meta device.

Also note from the official pytorch docs:

As of PyTorch 1.12, FSDP only offers limited support for shared parameters (for example, setting one Linear layer’s weight to another’s). In particular, modules that share parameters must be wrapped as part of the same FSDP unit. If enhanced shared parameter support is needed for your use case, please ping https://github.com/pytorch/pytorch/issues/77724

I will keep investigating and let you know.

jiangjiadi commented 1 month ago

Hi @younesbelkada Thank you for looking into this issue. I appreciate your prompt response and I am looking forward to any updates.

Additionally, I've noticed that when the from_config method is called with DeepSpeed's zero3 enabled, the model gets pre-partitioned. Could a similar approach be adopted for FSDP initialization? Pre-partitioning the model at definition could potentially help mitigate OOM issues when training large models.

amyeroberts commented 1 week ago

cc @muellerzr @SunMarc

huggingface / transformers