Exception raised with trainer + `accelerate launch` FSDP + large gradient accumulation steps + small dataset

System Info

transformers version: 4.44.2
Platform: Linux-5.15.0-119-generic-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.29.2
Accelerate config: not found
PyTorch version (GPU?): 2.2.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: yes (accelerate FSDP)
Using GPU in script?: yes
GPU type: NVIDIA RTX A6000

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

This is a duplicate of #24098 and #25695, but I figured it'd still be useful to resubmit this issue since (1) I have a code example, and (2) I paste a different error message I get with mixed precision, which may increase visibility for other people who run into this problem and search for existing GitHub issues.

When I do multi-GPU training (launched with accelerate launch --num_processes=2) using Trainer with a small dataset size and gradient_accumulation_steps > 2, I often repeatedly get the following error:

Traceback (most recent call last):
  File "/workspace/program.py", line 34, in <module>
    trainer.train()
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 150, in step
    self.optimizer.step(closure)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 385, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 187, in step
    adamw(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 339, in adamw
    func(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 549, in _multi_tensor_adamw
    torch._foreach_lerp_(device_exp_avgs, device_grads, 1 - beta1)
RuntimeError: The size of tensor a (3219712) must match the size of tensor b (128) at non-singleton dimension 1

If FP16 mixed-precision is enabled then the error looks like this instead:

Traceback (most recent call last):
  File "/workspace/program.py", line 34, in <module>
    trainer.train()
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 137, in step
    self.scaler.step(self.optimizer, closure)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 457, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 352, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 192, in patched_step
    return method(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 385, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 187, in step
    adamw(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 339, in adamw
    func(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 516, in _multi_tensor_adamw
    grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 409, in _group_tensors_by_device_and_dtype
    return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/utils/_foreach_utils.py", line 38, in _group_tensors_by_device_and_dtype
    torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding

Here's a minimal example — run the following with accelerate launch --config_file=accelerate_config.yaml --num_processes=2 program.py

# program.py
from datasets import Dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)

dataset = Dataset.from_dict(
    {"text": ["positive", "negative"], "label": [1, 0]}
)  # tiny dataset of 2 examples

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-14m")
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"]), batched=True)

model = AutoModelForSequenceClassification.from_pretrained(
    "EleutherAI/pythia-14m", num_labels=2
)
model.config.pad_token_id = tokenizer.eos_token_id

training_args = TrainingArguments(
    output_dir="/tmp/results",
    num_train_epochs=10,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=16,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

# accelerate_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: "no"
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: "no"  # change this to "fp16" to get the other error
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

My use case for this was that I had a codebase where we had added some end-to-end tests. We used a very small dataset size since we wanted the test to still be reasonably fast, but then we hit into these exceptions and were confused.

Expected behavior

I think I expect this to just work without crashing. But maybe it's not really a sensible setup to have such a small training set. In #24098 commenters suggested that the training set size

has to be greater than gradient_accumulation_steps num_GPUs per_device_train_batch_size.

In that case it would be nice to have an error message saying that this is the problem.

huggingface / transformers