NLLB-200 Accelerate-based multi-GPU finetuning leads to 3x VRAM consumption as compared to single-GPU finetuning

molokanov50 commented 1 year ago

System Info

transformers version: 4.32.1
Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.31
Python version: 3.9.7
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.3
Accelerate version: 0.22.0
Accelerate config: not found
PyTorch version (GPU?): 1.10.1+cu113 (True) The versions of the following packages are not specified and, therefore, are the latest:
sentencepiece
sacrebleu
sacremoses
psutil
nltk
evaluate
scikit-learn

Who can help?

@SunMarc

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I run multi-GPU and, for comparison, single-GPU finetuning of NLLB-200-distilled-600M and NLLB-200-1.3B. In multi-GPU finetuning, I'm always on 2x 24 GB GPUs (48 GB VRAM in total). I successfully finetuned NLLB-200-distilled-600M on a single 12 GB GPU, as well as NLLB-200-1.3B on a 40 GB GPU. Thus, my VRAM resources in my multi-GPU configuration is obviously greater than in any single-GPU scenario. To my surprise, NLLB-200-distilled-600M finetuning on 2 GPUs occupied 30 GB VRAM that is 3 times greater than the memory required for a single-GPU finetuning. Also, for NLLB-200-1.3B finetuning on 2 GPUs I got CUDA OOM, i.e., 48 GB VRAM is insufficient to perform this finetuning. On the other hand, a 40 GB GPU is sufficient for a single-GPU finetuning. Seems too strange, since in model parallelism, only some part of a model resides on each GPU, and the used memory on each GPU should be less than in a single-GPU scenario.

My multi-GPU finetuning code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.utils.data
from transformers import DataCollatorForSeq2Seq
import evaluate
import numpy as np
from argparse import ArgumentParser

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

modelPath = "facebook/nllb-200-distilled-600M"

tokenizer = AutoTokenizer.from_pretrained(modelPath)
model = AutoModelForSeq2SeqLM.from_pretrained(modelPath, device_map="auto")

parser = ArgumentParser()
parser.add_argument('--source-lang', type=str, default='eng_Latn')
parser.add_argument('--target-lang', type=str, default='rus_Cyrl')
parser.add_argument('--delimiter', type=str, default=';')
args = parser.parse_args()

dff = pd.read_csv('dataset/data.csv', sep=args.delimiter)

source = dff[args.source_lang].values.tolist()
target = dff[args.target_lang].values.tolist()

max = 512
X_train, X_val, y_train, y_val = train_test_split(source, target, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=max, return_tensors="pt")
y_train_tokenized = tokenizer(y_train, padding=True, truncation=True, max_length=max, return_tensors="pt")
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=max, return_tensors="pt")
y_val_tokenized = tokenizer(y_val, padding=True, truncation=True, max_length=max, return_tensors="pt")

class ForDataset(torch.utils.data.Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, index):
        input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
        target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()

        return {"input_ids": input_ids, "labels": target_ids}

train_dataset = ForDataset(X_train_tokenized, y_train_tokenized)
test_dataset = ForDataset(X_val_tokenized, y_val_tokenized)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="pt")

metric = evaluate.load("sacrebleu")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

training_args = Seq2SeqTrainingArguments(
    output_dir="mymodel",
    evaluation_strategy="epoch",
    save_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=20,
    predict_with_generate=True,
    load_best_model_at_end=True
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model('finalmodel')

Text of the shell file used to run my code: python3 finetune.py --source-lang eng_Latn --target-lang rus_Cyrl --delimiter ';' data.csv

Expected behavior

Comparable (approximately equal) summary VRAM consumption in multi-GPU and single-GPU finetuning scenarios.

amyeroberts commented 1 year ago

cc @muellerzr @pacman100

SunMarc commented 1 year ago

Hi @molokanov50, thanks for reporting. I found out that the problem is specific to this model (loading with device_map consume more vram as expected). Other models such as t5-small have comparable VRAM consumption in multi-GPU and single-GPU fine-tuning scenarios. I'll try to fix that. If you find the issue, feel free to do a PR !

pacman100 commented 1 year ago

Hello @molokanov50, if the model fits on a single GPU, I would advise you to use DDP without the device_map for faster training as it will use both the GPUs all the time instead of naive pipelining of device_map

molokanov50 commented 1 year ago

Hello @pacman100, DDP unfortunately doesn't fit me because my overall motivation is to finetune an NLLB-200 model as large as NLLB-200-3.3B. I know from my experiments (see above) that a single-GPU finetuning of NLLB-200-1.3B requires 35...40 GB VRAM. This enables me to make an estimation that to finetune NLLB-200-3.3B (3x amount of parameters) I will need a single 105...120 GB GPU. We have no such GPUs at the moment, so NLLB-200-3.3B cannot fit any of available ones. That is definitely the case when the model doesn't fit on a single GPU. The 2-GPU parallelization of a smaller model such as NLLB-200-1.3B over smaller GPUs (such that the model cannot fit any single one) is necessary and informative; by this, we model the aforementioned case. Without this experiment, assembling a multi-GPU node with total 120 GB VRAM for NLLB-200-3.3B makes no sense. We need to make sure that pipeline-parallelized NLLB-200 training can eventually consume the same (summary) VRAM amount as in the single-GPU case (maybe, after some fixes).

molokanov50 commented 1 year ago

Hi @SunMarc, As for now, has it become possible to fix the problem?

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers