huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.23k stars 26.61k forks source link

Exception raised with trainer + `accelerate launch` FSDP + large gradient accumulation steps + small dataset #33413

Open tomtseng opened 1 month ago

tomtseng commented 1 month ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

This is a duplicate of #24098 and #25695, but I figured it'd still be useful to resubmit this issue since (1) I have a code example, and (2) I paste a different error message I get with mixed precision, which may increase visibility for other people who run into this problem and search for existing GitHub issues.

When I do multi-GPU training (launched with accelerate launch --num_processes=2) using Trainer with a small dataset size and gradient_accumulation_steps > 2, I often repeatedly get the following error:

Traceback (most recent call last):
  File "/workspace/program.py", line 34, in <module>
    trainer.train()
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 150, in step
    self.optimizer.step(closure)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 385, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 187, in step
    adamw(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 339, in adamw
    func(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 549, in _multi_tensor_adamw
    torch._foreach_lerp_(device_exp_avgs, device_grads, 1 - beta1)
RuntimeError: The size of tensor a (3219712) must match the size of tensor b (128) at non-singleton dimension 1

If FP16 mixed-precision is enabled then the error looks like this instead:

Traceback (most recent call last):
  File "/workspace/program.py", line 34, in <module>
    trainer.train()
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/usr/local/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 137, in step
    self.scaler.step(self.optimizer, closure)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 457, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 352, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 192, in patched_step
    return method(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 385, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 187, in step
    adamw(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 339, in adamw
    func(
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 516, in _multi_tensor_adamw
    grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([
  File "/usr/local/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 409, in _group_tensors_by_device_and_dtype
    return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/venv/lib/python3.10/site-packages/torch/utils/_foreach_utils.py", line 38, in _group_tensors_by_device_and_dtype
    torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding

Here's a minimal example — run the following with accelerate launch --config_file=accelerate_config.yaml --num_processes=2 program.py

# program.py
from datasets import Dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)

dataset = Dataset.from_dict(
    {"text": ["positive", "negative"], "label": [1, 0]}
)  # tiny dataset of 2 examples

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-14m")
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"]), batched=True)

model = AutoModelForSequenceClassification.from_pretrained(
    "EleutherAI/pythia-14m", num_labels=2
)
model.config.pad_token_id = tokenizer.eos_token_id

training_args = TrainingArguments(
    output_dir="/tmp/results",
    num_train_epochs=10,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=16,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()
# accelerate_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: "no"
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: "no"  # change this to "fp16" to get the other error
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

My use case for this was that I had a codebase where we had added some end-to-end tests. We used a very small dataset size since we wanted the test to still be reasonably fast, but then we hit into these exceptions and were confused.

Expected behavior

I think I expect this to just work without crashing. But maybe it's not really a sensible setup to have such a small training set. In #24098 commenters suggested that the training set size

has to be greater than gradient_accumulation_steps num_GPUs per_device_train_batch_size.

In that case it would be nice to have an error message saying that this is the problem.

github-actions[bot] commented 15 hours ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

SunMarc commented 12 hours ago

@MekkCyber is looking into that !