huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.87k stars 959 forks source link

Activation checkpointing with FSDP incorrectly splits the attention mask? #3117

Open gorjanradevski opened 1 month ago

gorjanradevski commented 1 month ago

System Info

- `Accelerate` version: 0.34.2
- Platform: Linux-5.4.0-45-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/gradevski/miniconda3/envs/summary_explainer_package/bin/accelerate
- Python version: 3.10.13
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 251.58 GB
- GPU type: Tesla V100-PCIE-32GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

import torch
from accelerate import Accelerator
from tqdm.auto import tqdm
from transformers import AutoModelForCausalLM

class DummyDataset(torch.utils.data.Dataset):
    def __init__(self, size: int):
        self.dataset = torch.randint(0, 100, (size, 10))

    def __len__(self):
        return self.dataset.shape[0]

    def __getitem__(self, idx):
        return {"input_ids": self.dataset[idx], "attention_mask": torch.ones(10), "labels": self.dataset[idx]}

def collate_fn(batch):
    # Convert list of dictionaries to dictionary of lists
    input_ids = torch.stack([x["input_ids"] for x in batch])
    attention_mask = torch.stack([x["attention_mask"] for x in batch])
    labels = torch.stack([x["labels"] for x in batch])

    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

def main():
    accelerator = Accelerator()
    dataset = DummyDataset(size=100)
    model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3.5-mini-instruct")
    loader = torch.utils.data.DataLoader(dataset, batch_size=2)
    optimizer = torch.optim.AdamW(model.parameters())
    loader, model, optimizer = accelerator.prepare(loader, model, optimizer)
    for batch in tqdm(loader):
        optimizer.zero_grad()
        outputs = model(**batch)
        accelerator.backward(outputs.loss)
        optimizer.step()

if __name__ == '__main__':
    main()

Expected behavior

By running the code snippet above CUDA_VISIBLE_DEVICES=0,1,6,7 accelerate launch --num_processes 4 scripts/test_fsdp.py, I get:

    attn_weights += causal_mask
RuntimeError: The size of tensor a (20) must match the size of tensor b (10) at non-singleton dimension 3

However, if I set 'fsdp_activation_checkpointing': False, then no such error takes place.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.