Significant Performance Discrepancy Between Single-GPU and Multi-GPU Training with BERT

Mr-KenLee commented 1 month ago

System Info

transformers>=4.43.2 accelerate==0.33.0 GPUs: 2 x A100

Who can help?

@ArthurZucker @muellerzr @SunMarc

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I am currently using the transformers library (version >= 4.43.2) for training BERT models. I have observed a significant performance discrepancy when training with a single GPU compared to training with multiple GPUs. Specifically, the performance difference can reach up to 10 percentage points.

Additionally, I have noticed that this issue persists in lower versions of the library, such as 4.42.1.

Expected behavior

Same performance between single gpu and multi-gpu.

Mr-KenLee commented 1 month ago

There is a userwarning：/root/miniconda3/lib/python3.11/site-packages/torch/autograd/graph.py:768: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [312, 312], strides() = [312, 1] bucket_view.sizes() = [312, 312], strides() = [1, 312] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:327.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

May it cause this problem above?

amyeroberts commented 1 month ago

Hi @Mr-KenLee, thanks for raising an issue!

Could you share a minimal reproducible snippet that can replicate this issue?

Regarding the warning - a full traceback would be needed to try and diagnose the root of the issue.

Specifically, the performance difference can reach up to 10 percentage points.

Is the difference always one-way? i.e. performance is always greater on the multi-gpu setup? What's the variance of performance on repeated runs on the same setup?

Mr-KenLee commented 3 weeks ago

Thank you for your response @amyeroberts . The performance of multiple GPUs is actually worse than that of a single GPU. I suspect it might be due to the reasons mentioned in this warning, such as some parameters not being aligned. When I increase the number of epochs and the learning rate, the difference between multiple GPUs and a single CPU decreases significantly. Here is my shell scripts on multiple-gpu:

accelerate launch --config_file config.yaml python train.py \
            --model_name_or_path "" \
            --trainset_path "" \
            --testset_path "" \
            --cache_dir "cache" \
            --model_type "bert" \
            --use_focalloss True \
            --max_seq_length 512 \
            --seed 42 \
            --learning_rate 1e-5 \
            --weight_decay 1e-3 \
            --warmup_ratio 0.1 \
            --max_grad_norm 1.0 \
            --gradient_accumulation_steps 1 \
            --gradient_checkpointing False \
            --num_train_epochs 5 \
            --logging_steps 100 \
            --logging_strategy "steps" \
            --logging_first_step \
            --per_device_train_batch_size 256 \
            --per_device_eval_batch_size 512 \
            --evaluation_strategy "steps" \
            --eval_steps 500 \
            --save_strategy "steps" \
            --save_steps 500 \
            --report_to "none" \
            --output_dir "output/" \
            --metric_for_best_model f1_score \
            --greater_is_better True \
            --save_total_limit 2 \
            --load_best_model_at_end True \
            --ddp_find_unused_parameters False \

Mr-KenLee commented 3 weeks ago

And the accelerate config file is:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

amyeroberts commented 3 weeks ago

Thanks for sharing! Could you also share train.py? This might indicate some reasons why there's differences in the multi-gpu case.

Good to know increasing the number of epochs reduces this

cc @muellerzr @SunMarc

muellerzr commented 3 weeks ago

What's the single GPU script being launched with like for the arguments? Generally you want to make sure the total batch size is the same, so e.g. if we have a bs=16 on 2x GPU that's an effective BS of 32 we need to try on the single GPU (and why more epochs would reduce this discrepancy, I'd imagine 2x the epochs)

Mr-KenLee commented 3 weeks ago

@amyeroberts here is my train.py

from transformers import (AutoModelForSequenceClassification,
                          AutoTokenizer,
                          EvalPrediction,
                          DataCollatorWithPadding,
                          Trainer,
                          HfArgumentParser,
                          TrainingArguments)

from utils import preprocess, DataCollatorForBERT, ModelArguments, DataArguments, LABELS, CustomTrainer
from sklearn.metrics import f1_score, classification_report

from datasets import load_dataset, DatasetDict
from accelerate import PartialState

import numpy as np
import os
import logging

# logging.basicConfig(level=logging.ERROR)

def main(model_args, data_args, training_args):

    model = AutoModelForSequenceClassification.from_pretrained(
        pretrained_model_name_or_path=model_args.model_name_or_path,
        num_labels=len(LABELS),
        problem_type="multi_label_classification",
        trust_remote_code=True
    )

    tokenizer = AutoTokenizer.from_pretrained(
        pretrained_model_name_or_path=model_args.model_name_or_path,
        trust_remote_code=True
    )

    with PartialState().main_process_first():
        dataset = DatasetDict()
        dataset["train"] = load_dataset("json", data_files=data_args.trainset_path, cache_dir=data_args.cache_dir, split="train")
        dataset["test"] = load_dataset("json", data_files=data_args.testset_path, cache_dir=data_args.cache_dir, split="train")

        dataset = dataset.map(preprocess,
                              fn_kwargs={
                                  "tokenizer": tokenizer,
                                  "data_args": data_args
                                  },
                                  num_proc=4)

    def compute_metrics(eval_prediction: EvalPrediction):
        def sigmoid(x):
            return 1 / (1 + np.exp(-x))

        logits, labels = eval_prediction
        logits = sigmoid(logits)

        predictions = np.where(logits>=0.5, 1, 0)

        fine_report = classification_report(labels, predictions, zero_division=0, target_names=LABELS)
        fine_f1_score = f1_score(labels, predictions, average='macro', zero_division=0)

        labels = np.max(labels, axis=1)
        predictions = np.max(predictions, axis=1)
        coarse_report = classification_report(labels, predictions)

        if PartialState().is_local_main_process:
            print(fine_report, '\n')
            print(coarse_report, '\n\n')
            print()

        return {
           'f1_score': fine_f1_score
        }

    collator = DataCollatorForBERT(
        tokenizer=tokenizer,
        max_length=data_args.max_seq_length
    )

    trainer = CustomTrainer(
        model=model,
        args=training_args,
        data_collator=collator,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        use_focalloss=model_args.use_focalloss
    )

    trainer.train()
    trainer.save_model(os.path.join(training_args.output_dir, "best_model"))

if __name__ == "__main__":

    parser = HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    main(model_args, data_args, training_args)

Mr-KenLee commented 3 weeks ago

and the utils.py is here:

import torch
from torch import nn, Tensor, LongTensor
from transformers import Trainer
import torch

from transformers import BertTokenizer
from dataclasses import dataclass, field
from typing import Dict, List, Optional

from pypinyin import lazy_pinyin

LABELS = [""]

label2ids = {label: i for i, label in enumerate(LABELS)}
id2labels = {i: label for i, label in enumerate(LABELS)}

@dataclass
class DataArguments:
    trainset_path: Optional[str] = field(
        default=None,
        metadata={"help": "The preference trainset to use."},
    )
    testset_path: Optional[str] = field(
        default=None,
        metadata={"help": "The preference dataset to use."},
    )
    predict_file_name: Optional[str] = field(
        default="results.jsonl",
        metadata={"help": "The file name of the results."}
    )
    cache_dir: Optional[str] = field(
        default="cache",
        metadata={"help": "The file name of the cache."}
    )
    max_seq_length: Optional[int] = field(default=512)

@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    model_type: str = field(
        default="bert"
    )
    use_focalloss: bool = field(
        default=True
    )

class FocalLossWithBCE(nn.Module):
    def __init__(self, 
                 alpha: float = 0.25, 
                 gamma: float = 2,
                 reduce: str = "mean",
                 eps: float = 1e-7):
        """
        Focal Loss with Binary Cross Entropy (BCE) Implementation

        Args:
            alpha (float): Weighting factor to balance class frequencies. Defaults to 0.25.
            gamma (float): Focusing parameter to adjust the rate at which easy examples are down-weighted. Defaults to 2.
            reduce (str): Specifies the reduction to apply to the output: 'mean' | 'sum'. Defaults to 'mean'.
            eps (float): A small value to prevent division by zero. Defaults to 1e-7.
        """
        super().__init__()

        # Class weights
        self.class_weights = torch.Tensor([alpha, 1 - alpha])
        self.gamma = gamma
        self.reduce = reduce
        self.eps = eps

    def forward(self, input: Tensor, target: Tensor) -> Tensor:
        """
        Forward pass of the Focal Loss with BCE

        Args:
            input (Tensor): Input tensor from the model (raw scores/logits).
            target (Tensor): Ground truth labels.

        Returns:
            Tensor: Computed loss value.
        """
        if not isinstance(target, LongTensor):
            target = target.long()  # Convert to CELoss Format, (B, S)

        # Compute positive and negative class probabilities
        positive_input = torch.sigmoid(input)   # (B, S, 1)
        negative_input = 1 - positive_input     # (B, S, 1)

        input_probabilities = torch.stack((negative_input, positive_input), dim=-1)   # (B, S, 2)

        # Compute log probabilities
        log_input_probabilities = torch.log(input_probabilities + self.eps)

        # Compute class-specific probabilities and weights
        preds_prob = torch.gather(input_probabilities.view(-1, 2), dim=1, index=target.view(-1, 1))             # (BxS, 1)
        preds_log_prob = torch.gather(log_input_probabilities.view(-1, 2), dim=1, index=target.view(-1, 1))     # (BxS, 1)

        class_weights = self.class_weights.to(target.device).gather(0, target.view(-1))     # (BxS)

        # Compute focal loss
        loss = -torch.mul(torch.pow((1 - preds_prob), self.gamma), preds_log_prob) 
        loss = torch.mul(class_weights, loss.t())

        # Reduce loss according to specified method
        if self.reduce == "mean":
            loss = loss.mean()
        elif self.reduce == "sum":
            loss = loss.sum()

        return loss

class CustomTrainer(Trainer):
    def __init__(self, *args, use_focalloss=False, **kwargs):
        super().__init__(*args, **kwargs)
        self.focal_loss = FocalLossWithBCE()
        self.use_focalloss = use_focalloss

    def compute_loss(self, model, inputs, return_outputs=False):
        if self.use_focalloss and "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None

        outputs = model(**inputs)
        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        if labels is not None:
            logits = outputs.get("logits") if isinstance(outputs, dict) else outputs[1]
            loss = self.focal_loss(logits, labels)
        else:
            if isinstance(outputs, dict) and "loss" not in outputs:
                raise ValueError(
                    "The model did not return a loss from the inputs, only the following keys: "
                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
                )
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss

@dataclass
class DataCollatorForBERT:
    tokenizer: BertTokenizer
    max_length: int

    def __call__(self, features, *args, **kwargs):

        batch = self.tokenizer.pad(
            features,
            padding=True,
            max_length=self.max_length,
            return_tensors="pt"
        )

        return batch

def preprocess(example, tokenizer:BertTokenizer, data_args:DataArguments):

    if isinstance(example, str):
        example = {
            "text": example
        }

    batch_encoded = tokenizer.encode_plus(text=example["text"],
                                          add_special_tokens=True,
                                          max_length=data_args.max_seq_length,
                                          truncation=True)

    labels = [0.0] * len(LABELS)

    if "labels" in example:

        if isinstance(example["labels"], list):
            targets = example["labels"]
        else:
            targets = example["labels"].replace("，", ",").split(",")

        for label in targets:
            if label in LABELS:
                label_id = label2ids[label]
                labels[label_id] = 1.0

    return {
        **batch_encoded,
        "labels": labels
    }

Mr-KenLee commented 3 weeks ago

@muellerzr I understand, I do the same. However, if I forget to change the parameters, the actual batch size on 2x GPU should be twice that of a single GPU, but the performance on a single GPU is better than on multiple GPUs in fact. This seems unreasonable to me, so I don't think it's a batch size issue.

Mr-KenLee commented 3 weeks ago

There is a userwarning：/root/miniconda3/lib/python3.11/site-packages/torch/autograd/graph.py:768: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [312, 312], strides() = [312, 1] bucket_view.sizes() = [312, 312], strides() = [1, 312] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:327.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

May it cause this problem above?

It's been a while since I encountered this error. During this period, I made many parameter changes and data adjustments. When I tried to reproduce the issue, it seemed that the results were normal. Additionally, the previous error did not occur. Thank you for your patience and support.

huggingface / transformers