huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.78k stars 2.59k forks source link

Load large text file for LM pre-training resulting in OOM #633

Open leethu2012 opened 3 years ago

leethu2012 commented 3 years ago

I tried to pretrain Longformer using transformers and datasets. But I got OOM issues with loading a large text file. My script is almost like this:

from datasets import load_dataset

@dataclass
class DataCollatorForDatasetsLanguageModeling(DataCollatorForLanguageModeling):
    """
    Data collator used for language modeling based on DataCollatorForLazyLanguageModeling
    - collates batches of tensors, honoring their tokenizer's pad_token
    - preprocesses batches for masked language modeling
    """

    block_size: int = 512

    def __call__(self, examples: List[dict]) -> Dict[str, torch.Tensor]:
        examples = [example['text'] for example in examples]
        batch, attention_mask = self._tensorize_batch(examples)
        if self.mlm:
            inputs, labels = self.mask_tokens(batch)
            return {"input_ids": inputs, "labels": labels}
        else:
            labels = batch.clone().detach()
            if self.tokenizer.pad_token_id is not None:
                labels[labels == self.tokenizer.pad_token_id] = -100
            return {"input_ids": batch, "labels": labels}

    def _tensorize_batch(self, examples: List[str]) -> Tuple[torch.Tensor, torch.Tensor]:

        if self.tokenizer._pad_token is None:
            raise ValueError(
                "You are attempting to pad samples but the tokenizer you are using"
                f" ({self.tokenizer.__class__.__name__}) does not have one."
            )

        tensor_examples = self.tokenizer.batch_encode_plus(
            [ex for ex in examples if ex],
            max_length=self.block_size,
            return_tensors="pt",
            pad_to_max_length=True,
            return_attention_mask=True,
            truncation=True,
        )

        input_ids, attention_mask = tensor_examples["input_ids"], tensor_examples["attention_mask"]
        return input_ids, attention_mask

dataset = load_dataset('text', data_files='train.txt',cache_dir="./", , split='train')
data_collator = DataCollatorForDatasetsLanguageModeling(tokenizer=tokenizer, mlm=True, 
                      mlm_probability=0.15, block_size=tokenizer.max_len)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
                      train_dataset=train_dataset, prediction_loss_only=True, )
trainer.train(model_path=model_path)

This train.txt is about 1.1GB and has 90k lines where each line is a sequence of 4k words. During training, the memory usage increased fast as the following graph and resulted in OOM before the finish of training.

image

Could you please give me any suggestions on why this happened and how to fix it? Thanks.

lhoestq commented 3 years ago

Not sure what could cause that on the datasets side. Could this be a Trainer issue ? cc @julien-c @sgugger ?

sgugger commented 3 years ago

There was a memory leak issue fixed recently in master. You should install from source and see if it fixes your problem.

leethu2012 commented 3 years ago

@lhoestq @sgugger Thanks for your comments. I have install from source code as you told, but the problem is still there. To reproduce the issue, just replace these lines with: (load_dataset and DataCollatorForDatasetsLanguageModeling as above mentioned)

    dataset = load_dataset("bookcorpus")
    dataset = dataset.train_test_split(test_size=0.1)
    train_dataset = dataset['train']
    eval_dataset = dataset['test'] if training_args.do_eval else None

    data_collator = DataCollatorForDatasetsLanguageModeling(
        tokenizer=tokenizer,
        mlm=data_args.mlm,
        mlm_probability=data_args.mlm_probability,
        block_size=data_args.block_size
    )

and run by:

python run_language_modeling.py
--output_dir=output \
--model_type=bert \
--model_name_or_path=bert-base-uncased \
--do_train \
--do_eval \
--mlm 
gaceladri commented 3 years ago

Same here. Pre-training on wikitext-103 to do some test. At the end of the training it takes 32GB of RAM + ~30GB of SWAP. I installed dataset==1.1.0, not built from source. I will try uninstalling and building from source when it finish.

thomwolf commented 3 years ago

This seems to be on the transformers library side.

If you have more informations (pip env) or even better, a colab reproducing the error we can investigate.

gaceladri commented 3 years ago

It seems like it's solved with freshed versions of transformers. I have tried to replicate the error doing a fresh pip install transformers & datasets on colab and the error doesn't continue. On colab it keeps stable on 5GB! (Y)

Edit: Thanks for your great work. Have a good day.

leethu2012 commented 3 years ago

@gaceladri witch version transformers and datasets are you using now? I want to try again. Thanks.

gaceladri commented 3 years ago

transformers==3.3.1 datasets==1.1.0 tokenizers==0.8.1rc2

gaceladri commented 3 years ago

doing some modifications to mobilebert https://colab.research.google.com/drive/1ba09ZOpyHGAOQLcsxiQAHRXl10qnMU5o?usp=sharing

gaceladri commented 3 years ago

It does not happen to me anymore. Can we close? @leethu2012

gaceladri commented 3 years ago

It's happening to me again. After 4 hours of pre-training, my ram memory gets full and the kernel dies. I am using the last transformers version as today. 4.4.0 and the last version of datasets 1.2.1, both installed from master. The memory consumption keeps increasing.

gaceladri commented 3 years ago

It looks like it is something from pytorch/python itself :face_with_head_bandage: https://github.com/pytorch/pytorch/issues/13246

lhoestq commented 3 years ago

Thanks for the investigation @gaceladri

Apparently this happens when num_workers>0 and has to do with objects being copied-on-write. Did you try setting num_workers to 0 @gaceladri ? If the issue doesn't happen with num_workers=0 then this would confirm that it's indeed related to this python/pytorch issue.

Since a Dataset object is a wrapper of a pyarrow Table, we should investigate if the data being copied comes from the Table itself or from metadata in the Dataset object. If it comes from the metadata in the Dataset object, we should be able to implement a workaround. But if it comes from the Table, we'll need to see with the pyarrow team what we can do...

gaceladri commented 3 years ago

@lhoestq I have tried and it keeps increasing also with dataloader_num_workers=0

lhoestq commented 3 years ago

Hmmm so this might come from another issue... Since it doesn't seem to be related to multiprocessing it should be easier to investigate though. Do you have some ideas @gaceladri ?

gaceladri commented 3 years ago

@lhoestq I looked quickly to a previously spoted bug in my env wandb /sdk/interface/interface.py, because sometimes when I load the dataset I got a multiprocessing error at line 510 in wandb...interface.py

This bug is reported here https://github.com/huggingface/datasets/issues/847

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<timed eval> in <module>

~/anaconda3/envs/tfm/lib/python3.6/site-packages/transformers/trainer.py in train(self, model_path, trial)
    877             print(len(epoch_iterator))
    878 
--> 879             for step, inputs in enumerate(epoch_iterator):
    880 
    881                 start_step = time.time()

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self)
   1083             else:
   1084                 del self._task_info[idx]
-> 1085                 return self._process_data(data)
   1086 
   1087     def _try_put_index(self):

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
   1109         self._try_put_index()
   1110         if isinstance(data, ExceptionWrapper):
-> 1111             data.reraise()
   1112         return data
   1113 

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/_utils.py in reraise(self)
    426             # have message field
    427             raise self.exc_type(message=msg)
--> 428         raise self.exc_type(msg)
    429 
    430 

AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1083, in __getitem__
    format_kwargs=self._format_kwargs,
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1070, in _getitem
    format_kwargs=format_kwargs,
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 886, in _convert_outputs
    v = map_nested(command, v, **map_nested_kwargs)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/utils/py_utils.py", line 216, in map_nested
    return function(data_struct)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 847, in command
    return torch.tensor(x, **format_kwargs)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 101, in _showwarnmsg
    _showwarnmsg_impl(msg)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 30, in _showwarnmsg_impl
    file.write(text)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/wandb_run.py", line 729, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 186, in publish_output
    self._publish_output(o)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 191, in _publish_output
    self._publish(rec)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 510, in _publish
    if self._process and not self._process.is_alive():
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

My workaround was to just comment those lines without looking to much into consecuences:

def _publish(self, record: pb.Record, local: bool = None) -> None:
        #if self._process and not self._process.is_alive():
        #    raise Exception("The wandb backend process has shutdown")

It worked so far... I need to try running without wandb and see if it could be causing something wrong with multiprocessing. I am going to try to launch the training setting wandb to false and I will let you know again.

gaceladri commented 3 years ago

@lhoestq But despite this, I got lost into the class Dataset() reading the pyarrow files.

Edit: but you should be rigth, that it does not have to be related to multiprocessing since it keeps happening when num_workers=0

lhoestq commented 3 years ago

Or maybe wandb uses multiprocessing ? One process for wandb logging and one for actual training ? If this is the case then even setting num_workers=0 would cause the process to be forked for wandb and therefore cause the memory issue.

gaceladri commented 3 years ago

@lhoestq could be, but if we set wandb to false this should not happen. I am going to try.

gaceladri commented 3 years ago

@lhoestq It keeps happening. I have uninstalled wandb from my env, setted %env WANDB_DISABLED=true on my notebook, and commented this func:

def get_available_reporting_integrations():
    integrations = []
    if is_azureml_available():
        integrations.append("azure_ml")
    if is_comet_available():
        integrations.append("comet_ml")
    if is_mlflow_available():
        integrations.append("mlflow")
    if is_tensorboard_available():
        integrations.append("tensorboard")
    # if is_wandb_available():
    #     integrations.append("wandb")
    return integrations

As a fast test and it keeps increasing the ram memory. Wandb could not be the blameworthy here.

lhoestq commented 3 years ago

Thanks for checking @gaceladri . Let's investigate the single process setting then. If you have some sort of colab notebook with a minimal code example that shows this behavior feel free to share it @gaceladri so that we can play around with it to find what causes this. Otherwise I'll probably try to reproduce on my side at one point

gaceladri commented 3 years ago

@lhoestq sure. Here you have https://colab.research.google.com/drive/1ba09ZOpyHGAOQLcsxiQAHRXl10qnMU5o?usp=sharing let me know if the link works and it reproduces the issue. To me, it reproduces the issue, since if you start the training the ram memory keeps increasing.

Let me know. Thanks!

gaceladri commented 3 years ago

Could the bug be comming from tokenizers?

I got this warning at the terminal from my jupyter notebook:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
lhoestq commented 3 years ago

I've never experienced memory issues with tokenizers so I don't know Cc @n1t0 are you aware of any issue that would cause memory to keep increasing when the tokenizer is used in the Data Collator for language modeling ?

gaceladri commented 3 years ago

@lhoestq Thanks for pointing to n1t0, just to clarify. That warning was doing fine-tuning, without collator:


from datasets import load_dataset, load_metric
import numpy as np

GLUE_TASKS = [
    "cola",
    "mnli",
    "mnli-mm",
    "mrpc",
    "qnli",
    "qqp",
    "rte",
    "sst2",
    "stsb",
    "wnli",
]
task = "mnli"
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric("glue", actual_task)
batch_size = 16
attention_type = "linear"

from transformers.models.mobilebert_mod import (
    MobileBertForSequenceClassification,
    MobileBertTokenizerFast,
)
from transformers.models.mobilebert_mod.configuration_mobilebert import (
    MobileBertConfigMod,
)
from transformers import TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
tokenizer = MobileBertTokenizerFast.from_pretrained(
    "/media/ad/00b5422b-9d54-4449-8b5d-08eab5cdac8c/training_trfm/big_linear_layerdrop_shared/checkpoint-23000/",
    max_len=512,
)
model = MobileBertForSequenceClassification.from_pretrained(
    "/media/ad/00b5422b-9d54-4449-8b5d-08eab5cdac8c/training_trfm/big_linear_layerdrop_shared/checkpoint-23000/",
    num_labels=num_labels,
)
print(model.num_parameters())

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)
metric_name = (
    "pearson"
    if task == "stsb"
    else "matthews_correlation"
    if task == "cola"
    else "accuracy"
)

args = TrainingArguments(
    f"test-glue/{task}_{attention_type}",
    evaluation_strategy="steps",
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=200,
    num_train_epochs=5,
    gradient_accumulation_steps=1,
    warmup_steps=10000,
    fp16=True,
    dataloader_num_workers=10,
    weight_decay=0.1,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

validation_key = (
    "validation_mismatched"
    if task == "mnli-mm"
    else "validation_matched"
    if task == "mnli"
    else "validation"
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Now, I have come back to pre-training. The changes that I think I have done are: not formatting the dataset to torch: big_dataset.set_format(type='torch', columns=["text", "input_ids", "attention_mask", "token_type_ids"]) so maybe some column is dropped and not freezed in memory and now I have not setted any validation dataset in the trainer.

My validation dataset before:

book_corpus_eval = load_dataset(
    "bookcorpus",
    "plain_text",
    cache_dir="/home/ad/Desktop/bookcorpus",
    split="train[98:99%]",
)
book_corpus_eval = book_corpus_eval.map(encode, batched=True)
book_corpus_eval.set_format(
    type="torch", columns=["text", "input_ids", "attention_mask", "token_type_ids"]
)
**book_corpus_eval = book_corpus_eval.select([i for i in range(1500)])**

Maybe selecting or indexing the dataset before feeding it to the trainer, do something strange.

My trainer now:


big_dataset = load_from_disk("/home/ad/Desktop/35percent_data.arrow/")

from transformers import DataCollatorForWholeWordMask

data_collator = DataCollatorForWholeWordMask(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./big_linear_layerdrop_shared_silu_secondtry",
    overwrite_output_dir=True,
    per_device_train_batch_size=60,
    per_device_eval_batch_size=60,
    save_steps=500,
    save_total_limit=10,
    logging_first_step=True,
    logging_steps=100,
#     evaluation_strategy='steps',
#     eval_steps=250,
    gradient_accumulation_steps=8,
    fp16=True,
    dataloader_num_workers=10,
    warmup_steps=15000,
    learning_rate=6e-4,
    adam_epsilon=1e-6,
    adam_beta2=0.98,
    weight_decay=0.01,
    max_grad_norm=1.0,
    max_steps=500000, 
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=big_dataset,
#     eval_dataset=book_corpus_eval,
    tokenizer=tokenizer)

import wandb
wandb.login()

trainer.train()

And surprisingly, the ram now keeps going up and down. The training is up now for 12h without collapse the ram. I don't know what could cause the leakage. :mag:

Edit: I didn't see the swap memory, that keeps increasing. So the problem persist.

lhoestq commented 3 years ago

Thanks for sharing your results. So you still had the issue for fine-tuning ? And the issue still appears with a bare-bone dataset from an arrow file...

gaceladri commented 3 years ago

Yes, on both cases. Fine-tuning a pre-trained model and pre-training from scratch with a local arrow file already pre-processed.