Trainer runs out of memory when computing eval score

soufianeelalami commented 4 years ago

Environment info

transformers version: 3.5.0
Platform: Linux-3.10.0-1127.19.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.4.0 (False)
Tensorflow version (GPU?): 2.3.0 (False)
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

Who can help

Trainer: @sgugger

Information

Model I am using (Bert, XLNet ...): Camembert

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

I am trying to finetune a Camembert Model for a mlm task. This is the configuration i am using:

training_args = TrainingArguments(
    seed=92,
    output_dir='./results',          # output directory
    disable_tqdm=False,
    prediction_loss_only=False,
    num_train_epochs=3,              # total number of training epochs
    learning_rate=1e-4,
    evaluation_strategy='steps',
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    eval_steps = 25,
    logging_dir='./logs',            # directory for storing logs
    logging_steps=5,
)

data_collator = DataCollatorForLanguageModeling(tokenizer=TOKENIZER, mlm=True, mlm_probability=0.15)

trainer = Trainer(
    model=MODEL,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics = compute_metrics
)

Steps to reproduce the behavior:

Load a train and validation dataset.
Define a compute_metrics function for evaluation.
evaluation works at the beginning but it raises a RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 57680691200 bytes. Error code 12 (Cannot allocate memory) when trying to run the nested_concat function inside the prediction_loop.

/usr/local/lib/python3.6/dist-packages/transformers/trainer.py in prediction_loop(self, dataloader, description, prediction_loss_only)
   1420                 losses_host = losses if losses_host is None else torch.cat((losses_host, losses), dim=0)
   1421             if logits is not None:
-> 1422                 preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
   1423             if labels is not None:
   1424                 labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100)

/usr/local/lib/python3.6/dist-packages/transformers/trainer_pt_utils.py in nested_concat(tensors, new_tensors, padding_index)
     84         return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
     85     elif isinstance(tensors, torch.Tensor):
---> 86         return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
     87     elif isinstance(tensors, np.ndarray):
     88         return numpy_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)

/usr/local/lib/python3.6/dist-packages/transformers/trainer_pt_utils.py in torch_pad_and_concatenate(tensor1, tensor2, padding_index)
     52 
     53     # Now let's fill the result tensor
---> 54     result = tensor1.new_full(new_shape, padding_index)
     55     result[: tensor1.shape[0], : tensor1.shape[1]] = tensor1
     56     result[tensor1.shape[0] :, : tensor2.shape[1]] = tensor2

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 57680691200 bytes. Error code 12 (Cannot allocate memory)

The machine i am using has 120Gb of RAM.

The data contains 20355 sentences with the max number of words in a sentence inferior to 200. The dataset fits easily in the RAM. The subset used for evaluation contains 4057 examples with the same structure as the training dataset.

Expected behavior

It seems that setting prediction_loss_only=True avoids the problem as it does not compute evaluation metrics and only loss metric, so it costs much lower RAM to compute. The downside obviously is that you dont get any evaluation metrics.

The Trainer should be able to handle the workload as we go further in evaluation steps. Maybe clearing heavy variables in the evaluation process might help avoid blowing up RAM by storing values that are too large.

sgugger commented 4 years ago

I'm not sure what the bug is: by requiring the complete predictions for your compute_metrics function, you are asking for an array of 4,057 by 200 by vocab_size (which for the base CamemBERT model is 30,522 I believe). This does not fit easily in RAM.

soufianeelalami commented 4 years ago

Is there another way to compute the metrics (or an estimation) without having to build such a huge vector ?

sgugger commented 4 years ago

You haven't shared what metric you are using so I have no idea.

soufianeelalami commented 4 years ago

This the function i'm using:

from sklearn.metrics import precision_recall_fscore_support
def compute_metrics(p: EvalPrediction) -> Dict:
    #print('raw_predictions: ', p.predictions,  '\n')
    #print('labels: ', p.label_ids,'\n')
    preds = np.argmax(p.predictions, axis=-1)
    #print('shape:', preds.shape, '\n')
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids.flatten(), preds.flatten(), average='weighted', zero_division=0)
    return {
        'accuracy': (preds == p.label_ids).mean(),
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

sgugger commented 4 years ago

I guess you could write your custom loop to store the predictions after the argmax together, this won't blow up memory the same way.

soufianeelalami commented 4 years ago

Great, thanks a lot for the tip !

I ll mark the issue as closed.

gphillips-ema commented 3 years ago

@soufianeelalami Did you come up with a solution for this issue? Our team has run into the same issue with nested_conat while evaluating on a fairly large dataset.

selalami commented 3 years ago

@gphillips-ema Hello, basically what you need to do is create your trainer class (which inherits from the trainer class) then override the prediction_loopmethod to change one particular behavior:

if logits is not None:
                #preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
                logits_reduced = np.argmax(logits, axis=-1)
                preds_host = logits_reduced if preds_host is None else nested_concat(preds_host, logits_reduced, padding_index=-100)

You need to do a np.argmax(logits, axis=-1) to reduce the dimension of the output logit vector.

If you are using accumulation, then you need to do the same thing in that part of the code (always in the prediction_loopmethod).

Please let me know if this solves your problem or if you need any help.

malteos commented 3 years ago

I was facing a related issues with nested_concat that caused GPU memory errors. Using the Seq2SeqTrainer instead of the default Trainer solved the issue for me, since does not rely on concatenating the logits over the vocabulary.

Leonezz commented 2 years ago

Same issue, I got an A5000 gpu for training, but I can't even eval with batch_size=8.

SudoerAli commented 6 months ago

just reduce the per_device_eval_batch_size arg and set it to a lower value for example: per_device_eval_batch_size=2 that should prevent your issue

huggingface / transformers