Trainer's load_best_model_at_end argument results in error with DistributedDataParallel

abhishek0318 commented 3 years ago

Environment info

transformers version: 4.3.0
Platform: Linux
Python version: 3.8.5
PyTorch version (GPU?): 1.7.1 (CUDA Version: 11.2)
Tensorflow version (GPU?): NA
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes, DistributedDataParallel

Who can help

@sgugger

Information

Model I am using (Bert, XLNet ...): T5

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

    training_args = TrainingArguments(
        output_dir=os.path.join(output_dir, 'results'),
        overwrite_output_dir=True,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        warmup_steps=warmup_steps,
        weight_decay=weight_decay,
        logging_dir=os.path.join(output_dir, 'logs'),
        logging_steps=100,
        learning_rate=learning_rate,
        evaluation_strategy="epoch",
        max_grad_norm=max_grad_norm,
        metric_for_best_model="eval_loss",
        report_to=['tensorboard'],
        local_rank=local_rank)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset
    )

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Set load_best_model_at_end=True, when using DistributedDataParallel (python -m torch.distributed.launch ...) and the following stack trace appears after training is complete.
If you don't use DistributedDataParallel or don't set load_best_model_at_end to True, then this work as expected and there is no error.

OSError: Can't load config for 'checkpoint-115'. Make sure that: - 'checkpoint-115' is a correct model identifier listed on 'https://huggingface.co/models' - or 'checkpoint-115' is the correct path to a directory containing a config.json file

Expected behavior

No error.

sgugger commented 3 years ago

Could you explain a bit more the code you are running as well as the exact command you are using for launch? We can't help if we can't reproduce your bug and running:

python -m torch.distributed.launch --nproc_per_node 2 examples/text-classification/run_glue.py \
  --model_name_or_path bert-base-uncased \
  --task_name mrpc --output_dir test/mrpc \
  --load_best_model_at_end \
  --do_train \
  --do_eval \
  --evaluation_strategy epoch \
  --overwrite_output_dir

for instance does not reproduce it.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers