Activation checkpointing with FSDP incorrectly splits the attention mask?

System Info

- `Accelerate` version: 0.34.2
- Platform: Linux-5.4.0-45-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/gradevski/miniconda3/envs/summary_explainer_package/bin/accelerate
- Python version: 3.10.13
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 251.58 GB
- GPU type: Tesla V100-PCIE-32GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

import torch
from accelerate import Accelerator
from tqdm.auto import tqdm
from transformers import AutoModelForCausalLM

class DummyDataset(torch.utils.data.Dataset):
    def __init__(self, size: int):
        self.dataset = torch.randint(0, 100, (size, 10))

    def __len__(self):
        return self.dataset.shape[0]

    def __getitem__(self, idx):
        return {"input_ids": self.dataset[idx], "attention_mask": torch.ones(10), "labels": self.dataset[idx]}

def collate_fn(batch):
    # Convert list of dictionaries to dictionary of lists
    input_ids = torch.stack([x["input_ids"] for x in batch])
    attention_mask = torch.stack([x["attention_mask"] for x in batch])
    labels = torch.stack([x["labels"] for x in batch])

    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

def main():
    accelerator = Accelerator()
    dataset = DummyDataset(size=100)
    model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3.5-mini-instruct")
    loader = torch.utils.data.DataLoader(dataset, batch_size=2)
    optimizer = torch.optim.AdamW(model.parameters())
    loader, model, optimizer = accelerator.prepare(loader, model, optimizer)
    for batch in tqdm(loader):
        optimizer.zero_grad()
        outputs = model(**batch)
        accelerator.backward(outputs.loss)
        optimizer.step()

if __name__ == '__main__':
    main()

Expected behavior

By running the code snippet above CUDA_VISIBLE_DEVICES=0,1,6,7 accelerate launch --num_processes 4 scripts/test_fsdp.py, I get:

    attn_weights += causal_mask
RuntimeError: The size of tensor a (20) must match the size of tensor b (10) at non-singleton dimension 3

However, if I set 'fsdp_activation_checkpointing': False, then no such error takes place.

huggingface / accelerate

Activation checkpointing with FSDP incorrectly splits the attention mask? #3117

System Info

Information

Tasks

Reproduction

Expected behavior