huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.4k stars 26.37k forks source link

Training Bert2Bert with EncoderDecoderModel and Seq2SeqTrainer results with Cuda OOM #9647

Closed segef closed 3 years ago

segef commented 3 years ago

Hi, I am trying to train a Bert2Bert model for text summarization. I followed the exact steps in BERT2BERT for CNN/Dailymail. Only things that I changed are the training arguments and metrics. Additionally I have also tried to replace seq2seq_trainer with Seq2SeqTrainer from the package itself, the result was the same. I am using bert-base-uncased model for BERT and CNN/Dailymail as dataset (just like it was introduced in the colab).

from transformers Seq2SeqTrainingArguments, Seq2SeqTrainer

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge1", "rouge2"])
    rouge1 = rouge_output["rouge1"].mid
    rouge2 = rouge_output["rouge2"].mid

    return {
        "rouge1_precision": round(rouge1.precision, 4),
        "rouge1_recall": round(rouge1.recall, 4),
        "rouge1_fmeasure": round(rouge1.fmeasure, 4),
        "rouge2_precision": round(rouge2.precision, 4),
        "rouge2_recall": round(rouge2.recall, 4),
        "rouge2_fmeasure": round(rouge2.fmeasure, 4),
    }

training_args = Seq2SeqTrainingArguments(
    output_dir=output_folder,
    logging_dir=log_folder,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    evaluation_strategy=EvaluationStrategy.STEPS,
    do_train=True,
    do_eval=True,
    logging_steps=1000,  # set to 1000 for full training
    load_best_model_at_end=True,
    metric_for_best_model='rouge1_fmeasure',
    eval_steps=8000,  # set to 8000 for full training
    warmup_steps=2000,  # set to 2000 for full training
    overwrite_output_dir=True,
    save_total_limit=2,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=bert2bert,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=val_data,
)

Even with batch_size=1, I am getting the OOM. It seems like the cuda does not free any memory at all.

versions of my transformers and torch are as followed. transformers 4.2.0, torch 1.7.1+cu110

Can you help me with this issue? What do you think the issue might be?

LysandreJik commented 3 years ago

Hello! What is your machine? When you run the script, at which point does it fail? Right off the bat, or after a few sequences have been processed?

segef commented 3 years ago

I have tried it on my local GTX1650 and also on a 16gb T100. They both fail during processing the first sequence. It is not always at the same line but mostly during forward of SelfAttention module of the Bert. I also decreased the input sizes while processing the data with the tokenizer. It manages to process one sequence but then it fails with OOM again while processing the second sequence. Additionally, I tried training it directly on colab, it fails with a OOM there, too.

segef commented 3 years ago

Not sure how and why but the training started working on T100, even though I haven't really changed anything. The GPU might be just overloaded back then. I will close this issue.