huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

transformers seems to have recently been "bricked" #13798

Closed quantitative-technologies closed 2 years ago

quantitative-technologies commented 2 years ago

Environment info

Who can help

@sgugger

Information

The example script below was working fine until today. I believe that it was working in version 4.11.0.dev0. If you can please tell me how to checkout the source for 4.11.0.dev0 from github, I will confirm that it works.

To reproduce

Steps to reproduce the behavior:

On a TPU colab instance with High-RAM, run:

CHECKPOINT=bert-large-uncased
DATASET=rte
EPOCHS=2
BATCH_SIZE=16
LEARNING_RATE=3e-5

python transformers/examples/pytorch/xla_spawn.py --num_cores 8 \
  transformers/examples/pytorch/text-classification/run_glue.py \
  --model_name_or_path $CHECKPOINT \
  --task_name $DATASET \
  --seed 10000 \
  --output_dir results \
  --overwrite_output_dir \
  --num_train_epochs $EPOCHS \
  --evaluation_strategy no \
  --logging_strategy epoch \
  --save_strategy epoch \
  --per_device_train_batch_size $BATCH_SIZE \
  --per_device_eval_batch_size $BATCH_SIZE \
  --learning_rate $LEARNING_RATE \
  --do_train

Gives the error:

Exception in device=TPU:7: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:4: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:2: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:1: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:6: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:5: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:3: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:0: zero-dimensional tensor (at position 0) cannot be concatenated

  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data

uted/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
Traceback (most recent call last):
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated

  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ             | 20/40 [08:22<08:22, 25.15s/it]
Traceback (most recent call last):
  File "transformers/examples/pytorch/xla_spawn.py", line 85, in <module>
    main()
  File "transformers/examples/pytorch/xla_spawn.py", line 81, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 144, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 17

Expected behavior

No error.

sgugger commented 2 years ago

I see where the problem comes from. Will push a fix tonight or tomorrow morning, then we will do a patch release. In the meantime you should have no error by staying on v4.10

odellus commented 2 years ago

I run out of memory using transformers v4.X where X > 10 training led-large-16384-arxiv with four gradient accumulation steps and a batch size of two like in this notebook on an A6000 with 48 GB of RAM. I had to bump gradient accumulation steps and batch size down to 1 each to fit the model + batch on the GPU. Wild. Don't really feel like opening an issue, but yeah just thought I'd chirp in here and say that with v4.10.1 I can fit up to 8 samples per batch with four gradient accumulation steps on the A6000.

If you upgrade to 4.11.1 in the colab notebook I shared it fails, but for 4.10.1 it works just fine.