huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

Trainer's load_best_model_at_end argument results in error with DistributedDataParallel #10429

Closed abhishek0318 closed 3 years ago

abhishek0318 commented 3 years ago

Environment info

Who can help

@sgugger

Information

Model I am using (Bert, XLNet ...): T5

The problem arises when using:

    training_args = TrainingArguments(
        output_dir=os.path.join(output_dir, 'results'),
        overwrite_output_dir=True,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        warmup_steps=warmup_steps,
        weight_decay=weight_decay,
        logging_dir=os.path.join(output_dir, 'logs'),
        logging_steps=100,
        learning_rate=learning_rate,
        evaluation_strategy="epoch",
        max_grad_norm=max_grad_norm,
        metric_for_best_model="eval_loss",
        report_to=['tensorboard'],
        local_rank=local_rank)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset
    )

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Set load_best_model_at_end=True, when using DistributedDataParallel (python -m torch.distributed.launch ...) and the following stack trace appears after training is complete.
  2. If you don't use DistributedDataParallel or don't set load_best_model_at_end to True, then this work as expected and there is no error.
OSError: Can't load config for 'checkpoint-115'. Make sure that: - 'checkpoint-115' is a correct model identifier listed on 'https://huggingface.co/models' - or 'checkpoint-115' is the correct path to a directory containing a config.json file

Expected behavior

No error.

sgugger commented 3 years ago

Could you explain a bit more the code you are running as well as the exact command you are using for launch? We can't help if we can't reproduce your bug and running:

python -m torch.distributed.launch --nproc_per_node 2 examples/text-classification/run_glue.py \
  --model_name_or_path bert-base-uncased \
  --task_name mrpc --output_dir test/mrpc \
  --load_best_model_at_end \
  --do_train \
  --do_eval \
  --evaluation_strategy epoch \
  --overwrite_output_dir  

for instance does not reproduce it.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.