huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.21k stars 26.08k forks source link

AttributeError: 'DataParallel' object has no attribute 'model' #20583

Closed huynhhoanghuy closed 1 year ago

huynhhoanghuy commented 1 year ago

System Info

System Info

I am training parallel GPUs and not using pretrained weight. However, during training, I got this issue and break training:

 15%|█▌        | 246/1617 [09:01<48:36,  2.13s/it]
 15%|█▌        | 247/1617 [09:03<48:21,  2.12s/it]
 15%|█▌        | 248/1617 [09:05<48:20,  2.12s/it]
 15%|█▌        | 249/1617 [09:07<48:17,  2.12s/it]
 15%|█▌        | 250/1617 [09:10<48:35,  2.13s/it]***** Running Evaluation *****
  Num examples = 1500
  Batch size = 512
{'loss': 1.2497, 'learning_rate': 4.941249226963513e-05, 'epoch': 0.04}
{'loss': 0.6803, 'learning_rate': 4.879406307977737e-05, 'epoch': 0.07}
{'loss': 0.6134, 'learning_rate': 4.817563388991961e-05, 'epoch': 0.11}
{'loss': 0.5777, 'learning_rate': 4.7557204700061845e-05, 'epoch': 0.15}
{'loss': 0.5626, 'learning_rate': 4.6938775510204086e-05, 'epoch': 0.19}
{'loss': 0.5413, 'learning_rate': 4.6320346320346326e-05, 'epoch': 0.22}
{'loss': 0.5249, 'learning_rate': 4.570191713048856e-05, 'epoch': 0.26}
{'loss': 0.5015, 'learning_rate': 4.50834879406308e-05, 'epoch': 0.3}
{'loss': 0.5017, 'learning_rate': 4.4465058750773034e-05, 'epoch': 0.33}
{'loss': 0.4924, 'learning_rate': 4.3846629560915274e-05, 'epoch': 0.37}
{'loss': 0.4831, 'learning_rate': 4.3228200371057515e-05, 'epoch': 0.41}
{'loss': 0.4695, 'learning_rate': 4.2609771181199755e-05, 'epoch': 0.45}
Traceback (most recent call last):
  File "examples/run_train.py", line 105, in <module>
    main()
  File "examples/run_train.py", line 99, in main
    train_result = trainer.train()
  File "/root/data/huyhuynh/clrcmd-master/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1340, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/root/data/huyhuynh/clrcmd-master/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1445, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/root/data/huyhuynh/clrcmd-master/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2051, in evaluate
    output = eval_loop(
  File "/root/data/huyhuynh/clrcmd-master/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2223, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/root/data/huyhuynh/clrcmd-master/src/clrcmd/trainer.py", line 29, in prediction_step
    score = model.model(inputs1, inputs2)
  File "/root/data/huyhuynh/clrcmd-master/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 947, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'model'

 15%|█▌        | 250/1617 [09:10<50:11,  2.20s/it]

This is training code:

import argparse
import logging
import os
import uuid

from transformers import TrainingArguments, set_seed

from clrcmd.data.dataset import (
    ContrastiveLearningCollator,
    NLIContrastiveLearningDataset,
    STSBenchmarkDataset,
)
from clrcmd.data.sts import load_stsb_dev
from clrcmd.models import create_contrastive_learning, create_tokenizer
from clrcmd.trainer import STSTrainer, compute_metrics
import torch

logger = logging.getLogger(__name__)

parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# fmt: off
parser.add_argument("--data-dir", type=str, help="Data directory", default="data")
parser.add_argument("--model", type=str, help="Model", default="bert-cls",
                    choices=["bert-cls", "bert-avg", "bert-rcmd", "roberta-cls", "roberta-avg", "roberta-rcmd"])
parser.add_argument("--output-dir", type=str, help="Output directory", default="ckpt")
parser.add_argument("--temp", type=float, help="Softmax temperature", default=0.05)
parser.add_argument("--seed", type=int, help="Seed", default=0)
# fmt: on

def main():
    args = parser.parse_args()

    experiment_name = f"{args.model}-{uuid.uuid4()}"
    training_args = TrainingArguments(
        os.path.join(args.output_dir, experiment_name),
        per_device_train_batch_size=128,
        per_device_eval_batch_size=128,
        learning_rate=5e-5,
        num_train_epochs=3,
        fp16=True,
        logging_strategy="steps",
        logging_steps=20,
        evaluation_strategy="steps",
        eval_steps=250,
        save_strategy="steps",
        save_steps=250,
        metric_for_best_model="eval_spearman",
        load_best_model_at_end=True,
        greater_is_better=True,
        save_total_limit=1,
        seed=args.seed,
    )
    if training_args.local_rank == -1 or training_args.local_rank == 0:
        logging.basicConfig(
            level=logging.INFO,
            format="%(asctime)s - %(message)s",
            filename=f"log/train-{experiment_name}.log",
        )
    logger.info("Hyperparameters")
    for k, v in vars(args).items():
        logger.info(f"{k} = {v}")

    # Log on each process the small summary:
    logger.warning(
        f"Process rank: {training_args.local_rank}, "
        f"device: {training_args.device}, "
        f"n_gpu: {training_args.n_gpu}, "
        f"distributed training: {bool(training_args.local_rank != -1)}, "
        f"16-bits training: {training_args.fp16} "
    )

    # Set seed before initializing model.
    set_seed(training_args.seed)

    # Load pretrained model and tokenizer
    tokenizer = create_tokenizer(args.model)
    model = create_contrastive_learning(args.model, args.temp)
    ### model = torch.nn.DataParallel(model) --> tried but not fix ...
    model.train()

    train_dataset = NLIContrastiveLearningDataset(
        os.path.join(args.data_dir, "nli_for_simcse.csv"), tokenizer
    )
    eval_dataset = STSBenchmarkDataset(
        load_stsb_dev(os.path.join(args.data_dir, "STS", "STSBenchmark"))["dev"], tokenizer
    )

    trainer = STSTrainer(
        model=model,
        data_collator=ContrastiveLearningCollator(),
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,

    )
    train_result = trainer.train()
    logger.info(train_result)
    trainer.module.save_model(os.path.join(training_args.output_dir, "checkpoint-best"))

if __name__ == "__main__":
    main()

I searched problem, but I didn't find any solution for this. Could you help me?

Who can help?

No response

Information

Tasks

Reproduction

Wrap the model with train_result = trainer.train()

Expected behavior

Can solve issue

viettham1998 commented 1 year ago

contact me, I fixed it

atturaioe commented 1 year ago

Hi, @huynhhoanghuy. I think that clrcmd trainer is trying to access model.model while your model is wrapped into DataParallel, hence there is no .model attribute. See addressed issue

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.