Open leethu2012 opened 3 years ago
Not sure what could cause that on the datasets
side. Could this be a Trainer
issue ? cc @julien-c @sgugger ?
There was a memory leak issue fixed recently in master. You should install from source and see if it fixes your problem.
@lhoestq @sgugger Thanks for your comments. I have install from source code as you told, but the problem is still there. To reproduce the issue, just replace these lines with: (load_dataset and DataCollatorForDatasetsLanguageModeling as above mentioned)
dataset = load_dataset("bookcorpus")
dataset = dataset.train_test_split(test_size=0.1)
train_dataset = dataset['train']
eval_dataset = dataset['test'] if training_args.do_eval else None
data_collator = DataCollatorForDatasetsLanguageModeling(
tokenizer=tokenizer,
mlm=data_args.mlm,
mlm_probability=data_args.mlm_probability,
block_size=data_args.block_size
)
and run by:
python run_language_modeling.py
--output_dir=output \
--model_type=bert \
--model_name_or_path=bert-base-uncased \
--do_train \
--do_eval \
--mlm
Same here. Pre-training on wikitext-103 to do some test. At the end of the training it takes 32GB of RAM + ~30GB of SWAP. I installed dataset==1.1.0, not built from source. I will try uninstalling and building from source when it finish.
This seems to be on the transformers
library side.
If you have more informations (pip env) or even better, a colab reproducing the error we can investigate.
It seems like it's solved with freshed versions of transformers. I have tried to replicate the error doing a fresh pip install transformers & datasets on colab and the error doesn't continue. On colab it keeps stable on 5GB! (Y)
Edit: Thanks for your great work. Have a good day.
@gaceladri witch version transformers and datasets are you using now? I want to try again. Thanks.
transformers==3.3.1 datasets==1.1.0 tokenizers==0.8.1rc2
doing some modifications to mobilebert https://colab.research.google.com/drive/1ba09ZOpyHGAOQLcsxiQAHRXl10qnMU5o?usp=sharing
It does not happen to me anymore. Can we close? @leethu2012
It's happening to me again. After 4 hours of pre-training, my ram memory gets full and the kernel dies. I am using the last transformers version as today. 4.4.0 and the last version of datasets 1.2.1, both installed from master. The memory consumption keeps increasing.
It looks like it is something from pytorch/python itself :face_with_head_bandage: https://github.com/pytorch/pytorch/issues/13246
Thanks for the investigation @gaceladri
Apparently this happens when num_workers>0
and has to do with objects being copied-on-write.
Did you try setting num_workers to 0 @gaceladri ?
If the issue doesn't happen with num_workers=0
then this would confirm that it's indeed related to this python/pytorch issue.
Since a Dataset
object is a wrapper of a pyarrow Table, we should investigate if the data being copied comes from the Table itself or from metadata in the Dataset
object. If it comes from the metadata in the Dataset
object, we should be able to implement a workaround. But if it comes from the Table, we'll need to see with the pyarrow team what we can do...
@lhoestq I have tried and it keeps increasing also with dataloader_num_workers=0
Hmmm so this might come from another issue... Since it doesn't seem to be related to multiprocessing it should be easier to investigate though. Do you have some ideas @gaceladri ?
@lhoestq I looked quickly to a previously spoted bug in my env wandb /sdk/interface/interface.py, because sometimes when I load the dataset I got a multiprocessing error at line 510 in wandb...interface.py
This bug is reported here https://github.com/huggingface/datasets/issues/847
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<timed eval> in <module>
~/anaconda3/envs/tfm/lib/python3.6/site-packages/transformers/trainer.py in train(self, model_path, trial)
877 print(len(epoch_iterator))
878
--> 879 for step, inputs in enumerate(epoch_iterator):
880
881 start_step = time.time()
~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \
~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self)
1083 else:
1084 del self._task_info[idx]
-> 1085 return self._process_data(data)
1086
1087 def _try_put_index(self):
~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
1109 self._try_put_index()
1110 if isinstance(data, ExceptionWrapper):
-> 1111 data.reraise()
1112 return data
1113
~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/_utils.py in reraise(self)
426 # have message field
427 raise self.exc_type(message=msg)
--> 428 raise self.exc_type(msg)
429
430
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1083, in __getitem__
format_kwargs=self._format_kwargs,
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1070, in _getitem
format_kwargs=format_kwargs,
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 886, in _convert_outputs
v = map_nested(command, v, **map_nested_kwargs)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/utils/py_utils.py", line 216, in map_nested
return function(data_struct)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 847, in command
return torch.tensor(x, **format_kwargs)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 101, in _showwarnmsg
_showwarnmsg_impl(msg)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 30, in _showwarnmsg_impl
file.write(text)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
cb(name, data)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/wandb_run.py", line 729, in _console_callback
self._backend.interface.publish_output(name, data)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 186, in publish_output
self._publish_output(o)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 191, in _publish_output
self._publish(rec)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 510, in _publish
if self._process and not self._process.is_alive():
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
My workaround was to just comment those lines without looking to much into consecuences:
def _publish(self, record: pb.Record, local: bool = None) -> None:
#if self._process and not self._process.is_alive():
# raise Exception("The wandb backend process has shutdown")
It worked so far... I need to try running without wandb and see if it could be causing something wrong with multiprocessing. I am going to try to launch the training setting wandb to false and I will let you know again.
@lhoestq But despite this, I got lost into the class Dataset() reading the pyarrow files.
Edit: but you should be rigth, that it does not have to be related to multiprocessing since it keeps happening when num_workers=0
Or maybe wandb uses multiprocessing ? One process for wandb logging and one for actual training ? If this is the case then even setting num_workers=0
would cause the process to be forked for wandb and therefore cause the memory issue.
@lhoestq could be, but if we set wandb to false this should not happen. I am going to try.
@lhoestq It keeps happening. I have uninstalled wandb from my env, setted %env WANDB_DISABLED=true
on my notebook, and commented this func:
def get_available_reporting_integrations():
integrations = []
if is_azureml_available():
integrations.append("azure_ml")
if is_comet_available():
integrations.append("comet_ml")
if is_mlflow_available():
integrations.append("mlflow")
if is_tensorboard_available():
integrations.append("tensorboard")
# if is_wandb_available():
# integrations.append("wandb")
return integrations
As a fast test and it keeps increasing the ram memory. Wandb could not be the blameworthy here.
Thanks for checking @gaceladri . Let's investigate the single process setting then. If you have some sort of colab notebook with a minimal code example that shows this behavior feel free to share it @gaceladri so that we can play around with it to find what causes this. Otherwise I'll probably try to reproduce on my side at one point
@lhoestq sure. Here you have https://colab.research.google.com/drive/1ba09ZOpyHGAOQLcsxiQAHRXl10qnMU5o?usp=sharing let me know if the link works and it reproduces the issue. To me, it reproduces the issue, since if you start the training the ram memory keeps increasing.
Let me know. Thanks!
Could the bug be comming from tokenizers?
I got this warning at the terminal from my jupyter notebook:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
I've never experienced memory issues with tokenizers so I don't know Cc @n1t0 are you aware of any issue that would cause memory to keep increasing when the tokenizer is used in the Data Collator for language modeling ?
@lhoestq Thanks for pointing to n1t0, just to clarify. That warning was doing fine-tuning, without collator:
from datasets import load_dataset, load_metric
import numpy as np
GLUE_TASKS = [
"cola",
"mnli",
"mnli-mm",
"mrpc",
"qnli",
"qqp",
"rte",
"sst2",
"stsb",
"wnli",
]
task = "mnli"
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric("glue", actual_task)
batch_size = 16
attention_type = "linear"
from transformers.models.mobilebert_mod import (
MobileBertForSequenceClassification,
MobileBertTokenizerFast,
)
from transformers.models.mobilebert_mod.configuration_mobilebert import (
MobileBertConfigMod,
)
from transformers import TrainingArguments, Trainer
num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
tokenizer = MobileBertTokenizerFast.from_pretrained(
"/media/ad/00b5422b-9d54-4449-8b5d-08eab5cdac8c/training_trfm/big_linear_layerdrop_shared/checkpoint-23000/",
max_len=512,
)
model = MobileBertForSequenceClassification.from_pretrained(
"/media/ad/00b5422b-9d54-4449-8b5d-08eab5cdac8c/training_trfm/big_linear_layerdrop_shared/checkpoint-23000/",
num_labels=num_labels,
)
print(model.num_parameters())
task_to_keys = {
"cola": ("sentence", None),
"mnli": ("premise", "hypothesis"),
"mnli-mm": ("premise", "hypothesis"),
"mrpc": ("sentence1", "sentence2"),
"qnli": ("question", "sentence"),
"qqp": ("question1", "question2"),
"rte": ("sentence1", "sentence2"),
"sst2": ("sentence", None),
"stsb": ("sentence1", "sentence2"),
"wnli": ("sentence1", "sentence2"),
}
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")
def preprocess_function(examples):
if sentence2_key is None:
return tokenizer(examples[sentence1_key], truncation=True)
return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)
encoded_dataset = dataset.map(preprocess_function, batched=True)
metric_name = (
"pearson"
if task == "stsb"
else "matthews_correlation"
if task == "cola"
else "accuracy"
)
args = TrainingArguments(
f"test-glue/{task}_{attention_type}",
evaluation_strategy="steps",
learning_rate=1e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
logging_steps=200,
num_train_epochs=5,
gradient_accumulation_steps=1,
warmup_steps=10000,
fp16=True,
dataloader_num_workers=10,
weight_decay=0.1,
load_best_model_at_end=True,
metric_for_best_model=metric_name,
)
def compute_metrics(eval_pred):
predictions, labels = eval_pred
if task != "stsb":
predictions = np.argmax(predictions, axis=1)
else:
predictions = predictions[:, 0]
return metric.compute(predictions=predictions, references=labels)
validation_key = (
"validation_mismatched"
if task == "mnli-mm"
else "validation_matched"
if task == "mnli"
else "validation"
)
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset[validation_key],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
Now, I have come back to pre-training. The changes that I think I have done are: not formatting the dataset to torch: so maybe some column is dropped and not freezed in memory and now I have not setted any validation dataset in the trainer. big_dataset.set_format(type='torch', columns=["text", "input_ids", "attention_mask", "token_type_ids"])
My validation dataset before:
book_corpus_eval = load_dataset(
"bookcorpus",
"plain_text",
cache_dir="/home/ad/Desktop/bookcorpus",
split="train[98:99%]",
)
book_corpus_eval = book_corpus_eval.map(encode, batched=True)
book_corpus_eval.set_format(
type="torch", columns=["text", "input_ids", "attention_mask", "token_type_ids"]
)
**book_corpus_eval = book_corpus_eval.select([i for i in range(1500)])**
Maybe selecting or indexing the dataset before feeding it to the trainer, do something strange.
My trainer now:
big_dataset = load_from_disk("/home/ad/Desktop/35percent_data.arrow/")
from transformers import DataCollatorForWholeWordMask
data_collator = DataCollatorForWholeWordMask(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./big_linear_layerdrop_shared_silu_secondtry",
overwrite_output_dir=True,
per_device_train_batch_size=60,
per_device_eval_batch_size=60,
save_steps=500,
save_total_limit=10,
logging_first_step=True,
logging_steps=100,
# evaluation_strategy='steps',
# eval_steps=250,
gradient_accumulation_steps=8,
fp16=True,
dataloader_num_workers=10,
warmup_steps=15000,
learning_rate=6e-4,
adam_epsilon=1e-6,
adam_beta2=0.98,
weight_decay=0.01,
max_grad_norm=1.0,
max_steps=500000,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=big_dataset,
# eval_dataset=book_corpus_eval,
tokenizer=tokenizer)
import wandb
wandb.login()
trainer.train()
And surprisingly, the ram now keeps going up and down. The training is up now for 12h without collapse the ram. I don't know what could cause the leakage. :mag:
Edit: I didn't see the swap memory, that keeps increasing. So the problem persist.
Thanks for sharing your results. So you still had the issue for fine-tuning ? And the issue still appears with a bare-bone dataset from an arrow file...
Yes, on both cases. Fine-tuning a pre-trained model and pre-training from scratch with a local arrow file already pre-processed.
I tried to pretrain Longformer using transformers and datasets. But I got OOM issues with loading a large text file. My script is almost like this:
This train.txt is about 1.1GB and has 90k lines where each line is a sequence of 4k words. During training, the memory usage increased fast as the following graph and resulted in OOM before the finish of training.
Could you please give me any suggestions on why this happened and how to fix it? Thanks.