huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

Can't run QA fine-tune for bert/albert in distributed way #12159

Closed yl-to closed 3 years ago

yl-to commented 3 years ago

Environment info

Who can help

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior: Just run:

python -m torch.distributed.launch --nproc_per_node=8 run_qa.py \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --dataset_name squad \
    --do_train \
    --do_eval \
    --learning_rate 3e-5 \
    --num_train_epochs 1 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./new_out \
    --max_steps 100 \
    --per_device_eval_batch_size=3   \
    --per_device_train_batch_size=3 \
    --cache_dir .

Got error as below:

[INFO|trainer.py:2115] 2021-06-14 19:01:08,718 >> ***** Running Evaluation *****
[INFO|trainer.py:2117] 2021-06-14 19:01:08,718 >>   Num examples = 10784
[INFO|trainer.py:2120] 2021-06-14 19:01:08,718 >>   Batch size = 3
Traceback (most recent call last):
  File "run_qa.py", line 622, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_qa.py", line 622, in <module>
  File "run_qa.py", line 622, in <module>
  File "run_qa.py", line 622, in <module>
  File "run_qa.py", line 622, in <module>
Traceback (most recent call last):
  File "run_qa.py", line 622, in <module>
    main()
  File "run_qa.py", line 581, in main
            main()main()main()

  File "run_qa.py", line 581, in main
  File "run_qa.py", line 581, in main
main()  File "run_qa.py", line 581, in main

  File "run_qa.py", line 581, in main
Traceback (most recent call last):
  File "run_qa.py", line 622, in <module>
    metrics = trainer.evaluate()Traceback (most recent call last):

  File "/workspace/transformers/examples/pytorch/question-answering/trainer_qa.py", line 44, in evaluate
  File "run_qa.py", line 622, in <module>
            metrics = trainer.evaluate()metrics = trainer.evaluate()    output = eval_loop(

main()
  File "/workspace/transformers/examples/pytorch/question-answering/trainer_qa.py", line 44, in evaluate

  File "/workspace/transformers/examples/pytorch/question-answering/trainer_qa.py", line 44, in evaluate
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2162, in evaluation_loop
      File "run_qa.py", line 581, in main
metrics = trainer.evaluate()
  File "/workspace/transformers/examples/pytorch/question-answering/trainer_qa.py", line 44, in evaluate
    metrics = trainer.evaluate()
    output = eval_loop(  File "/workspace/transformers/examples/pytorch/question-answering/trainer_qa.py", line 44, in evaluate

  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2162, in evaluation_loop
    output = eval_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2162, in evaluation_loop
    main()
output = eval_loop(  File "run_qa.py", line 581, in main

      File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2162, in evaluation_loop
output = eval_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2162, in evaluation_loop
    metrics = trainer.evaluate()
  File "/workspace/transformers/examples/pytorch/question-answering/trainer_qa.py", line 44, in evaluate
    output = eval_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2162, in evaluation_loop
    main()
  File "run_qa.py", line 581, in main
    metrics = trainer.evaluate()
  File "/workspace/transformers/examples/pytorch/question-answering/trainer_qa.py", line 44, in evaluate
    output = eval_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2162, in evaluation_loop
    logits = self._nested_gather(logits)
    metrics = trainer.evaluate()  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _nested_gather

  File "/workspace/transformers/examples/pytorch/question-answering/trainer_qa.py", line 44, in evaluate
        logits = self._nested_gather(logits)output = eval_loop(

logits = self._nested_gather(logits)  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _nested_gather
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2162, in evaluation_loop

  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _nested_gather
    logits = self._nested_gather(logits)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _nested_gather
    logits = self._nested_gather(logits)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _nested_gather
    logits = self._nested_gather(logits)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _nested_gather
    logits = self._nested_gather(logits)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _nested_gather
    tensors = distributed_concat(tensors)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in distributed_concat
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in <genexpr>
        return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)    tensors = distributed_concat(tensors)
tensors = distributed_concat(tensors)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 156, in distributed_concat

  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in distributed_concat
      File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in distributed_concat
logits = self._nested_gather(logits)
      File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _nested_gather
tensors = distributed_concat(tensors)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in distributed_concat
        dist.all_gather(output_tensors, tensor)tensors = distributed_concat(tensors)

      File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1862, in all_gather
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in distributed_concat
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in <genexpr>
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in <genexpr>
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in <genexpr>
        return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)tensors = distributed_concat(tensors)

    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)      File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in <genexpr>
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in distributed_concat

tensors = distributed_concat(tensors)return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 156, in distributed_concat

  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in distributed_concat
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 156, in distributed_concat
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 156, in distributed_concat
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 156, in distributed_concat

      File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in <genexpr>
dist.all_gather(output_tensors, tensor)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1862, in all_gather
    dist.all_gather(output_tensors, tensor)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1862, in all_gather
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in <genexpr>
    dist.all_gather(output_tensors, tensor)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1862, in all_gather
        dist.all_gather(output_tensors, tensor)return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1862, in all_gather
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 156, in distributed_concat
        work = default_pg.allgather([tensor_list], [tensor])return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)

RuntimeError  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 156, in distributed_concat
: Tensors must be non-overlapping and dense
    dist.all_gather(output_tensors, tensor)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1862, in all_gather
    dist.all_gather(output_tensors, tensor)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1862, in all_gather
    tensors = distributed_concat(tensors)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in distributed_concat
    work = default_pg.allgather([tensor_list], [tensor])
    work = default_pg.allgather([tensor_list], [tensor])RuntimeError
: work = default_pg.allgather([tensor_list], [tensor])Tensors must be non-overlapping and dense
RuntimeError
: Tensors must be non-overlapping and dense
RuntimeError:     Tensors must be non-overlapping and densereturn type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)

  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 154, in <genexpr>
    work = default_pg.allgather([tensor_list], [tensor])
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
RuntimeError  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 156, in distributed_concat
: Tensors must be non-overlapping and dense
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be non-overlapping and dense
    dist.all_gather(output_tensors, tensor)
work = default_pg.allgather([tensor_list], [tensor])  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1862, in all_gather

RuntimeError: Tensors must be non-overlapping and dense
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be non-overlapping and dense
Killing subprocess 22340
Killing subprocess 22341
Killing subprocess 22342
Killing subprocess 22343
Killing subprocess 22344
Killing subprocess 22345
Killing subprocess 22346
Killing subprocess 22347
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'run_qa.py', '--local_rank=7', '--model_name_or_path', 'bert-large-uncased-whole-word-masking', '--dataset_name', 'squad', '--do_train', '--do_eval', '--learning_rate', '3e-5', '--num_train_epochs', '1', '--max_seq_length', '384', '--doc_stride', '128', '--output_dir', './new_out', '--max_steps', '100', '--per_device_eval_batch_size=3', '--per_device_train_batch_size=3', '--cache_dir', '.']' returned non-zero exit status 1.

Expected behavior

yl-to commented 3 years ago

@sgugger @philschmid

sgugger commented 3 years ago

Could you confirm #11872 fixes it?

yl-to commented 3 years ago

Could you confirm #11872 fixes it?

yeah, confirmed, closing issue.