RAM OOM with a large save_steps using trainer API for MLM training

Sumsky21 commented 2 years ago

Environment info

transformers version: 4.17.0
Platform: Linux-4.14.105-1-tlinux3-0013-x86_64-with-glibc2.17
Python version: 3.9.2
PyTorch version (GPU?): 1.10.2+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes (in fact the trainer API handles it automatically, with 8 gpus)

Who can help

Electra model: @LysandreJik
Trainer: @sgugger

Information

Model I am using (Bert, XLNet ...): Electra(ForMaskedLM)

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official task: Masked LM (training from scratch)
[ ] my own task or dataset: (give details below)

Expected behavior

I don't know if it was a bug, so I'll describe the situation first.

I'm using the trainer API to train a ElectraForMaskedLM model from scratch. I write a simple script (will be shown below) and set save_step argument as 10000. But after running the script, I observe the RAM(memory, NOT CUDA) usage continuously increases, finally casued an OOM error thus terminate the docker. At the same time, GPU, CUDA memory and CPU usage is quite normal.

屏幕截图 2022-03-13 171751

To solve the problem, I've tried adjusting several training arguments, and then find that when I set a smaller save_step (like 2000), RAM usage goes up and down periodically, and the cycle length is approximately equal to the interval between two save checkpoints, like the figure below:

屏幕截图 2022-03-13 172757

So is there a bug about the checkpoint saving mechanism? Or it's normal so I should have set a small save_steps?

The training script is below:

import torch
import numpy as np
from transformers import ElectraConfig, ElectraTokenizerFast, ElectraForMaskedLM
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = "cpu"
print(device)

# Tokenizer
print("Importing tokenizer and model...")
tokenizer = ElectraTokenizerFast(tokenizer_file='./models/bert-custom.json')
# Model
config = ElectraConfig(vocab_size=10000)
model = ElectraForMaskedLM(config=config).to(device)

# dataset
print("Importing dataset...")
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/in_neg_1e7.csv",
    block_size=80,
)
eval_set = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/in_eval_1e6.csv",
    block_size=80,
)

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

from transformers import Trainer, TrainingArguments, EarlyStoppingCallback
print("Start Training...")
training_args = TrainingArguments(
    output_dir="./electra_save/",
    overwrite_output_dir=True,
    num_train_epochs=50,
    per_device_train_batch_size=512,
    per_device_eval_batch_size=512,
    eval_accumulation_steps=300,
    save_steps=10000,                             # the key argument
    save_total_limit=2,
    dataloader_num_workers=80,
    evaluation_strategy='steps', 
    eval_steps=10000,
    logging_steps=1000,
    metric_for_best_model='eval_loss',
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    tokenizer=tokenizer,
    train_dataset=dataset,
    eval_dataset=eval_set,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)],
)

trainer.train()

trainer.save_model(output_dir='./electra_save/')

To reproduce

Steps to reproduce the behavior:

run the training script above with a large dataset (I use a dataset with 10M lines of text) and a large save_steps argument
wait for a period of time and monitor on the memory usage
encounter with OOM

Sumsky21 commented 2 years ago

Addition: Previously I've used a similar script to train BertForMaskedLM, and no memory fluctuations were observed that time: 屏幕截图 2022-03-13 185104 That's why I suspect it's abnormal this time.

sgugger commented 2 years ago

So the issue only appears with Electra, right? Did you encounter it for any other model?

Sumsky21 commented 2 years ago

Yes. I've only tried on Bert and Electra so far (and there's no problem during Bert training, as above figure has shown).

If there's a need, I can try the similar script on RobertaforMaskedLM in next few days.

sgugger commented 2 years ago

That would be helpful if you could!

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers