Open Mr-KenLee opened 1 month ago
There is a userwarning:/root/miniconda3/lib/python3.11/site-packages/torch/autograd/graph.py:768: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [312, 312], strides() = [312, 1] bucket_view.sizes() = [312, 312], strides() = [1, 312] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:327.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
May it cause this problem above?
Hi @Mr-KenLee, thanks for raising an issue!
Could you share a minimal reproducible snippet that can replicate this issue?
Regarding the warning - a full traceback would be needed to try and diagnose the root of the issue.
Specifically, the performance difference can reach up to 10 percentage points.
Is the difference always one-way? i.e. performance is always greater on the multi-gpu setup? What's the variance of performance on repeated runs on the same setup?
Thank you for your response @amyeroberts . The performance of multiple GPUs is actually worse than that of a single GPU. I suspect it might be due to the reasons mentioned in this warning, such as some parameters not being aligned. When I increase the number of epochs and the learning rate, the difference between multiple GPUs and a single CPU decreases significantly. Here is my shell scripts on multiple-gpu:
accelerate launch --config_file config.yaml python train.py \
--model_name_or_path "" \
--trainset_path "" \
--testset_path "" \
--cache_dir "cache" \
--model_type "bert" \
--use_focalloss True \
--max_seq_length 512 \
--seed 42 \
--learning_rate 1e-5 \
--weight_decay 1e-3 \
--warmup_ratio 0.1 \
--max_grad_norm 1.0 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing False \
--num_train_epochs 5 \
--logging_steps 100 \
--logging_strategy "steps" \
--logging_first_step \
--per_device_train_batch_size 256 \
--per_device_eval_batch_size 512 \
--evaluation_strategy "steps" \
--eval_steps 500 \
--save_strategy "steps" \
--save_steps 500 \
--report_to "none" \
--output_dir "output/" \
--metric_for_best_model f1_score \
--greater_is_better True \
--save_total_limit 2 \
--load_best_model_at_end True \
--ddp_find_unused_parameters False \
And the accelerate config file is:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Thanks for sharing! Could you also share train.py
? This might indicate some reasons why there's differences in the multi-gpu case.
Good to know increasing the number of epochs reduces this
cc @muellerzr @SunMarc
What's the single GPU script being launched with like for the arguments? Generally you want to make sure the total batch size is the same, so e.g. if we have a bs=16 on 2x GPU that's an effective BS of 32 we need to try on the single GPU (and why more epochs would reduce this discrepancy, I'd imagine 2x the epochs)
@amyeroberts here is my train.py
from transformers import (AutoModelForSequenceClassification,
AutoTokenizer,
EvalPrediction,
DataCollatorWithPadding,
Trainer,
HfArgumentParser,
TrainingArguments)
from utils import preprocess, DataCollatorForBERT, ModelArguments, DataArguments, LABELS, CustomTrainer
from sklearn.metrics import f1_score, classification_report
from datasets import load_dataset, DatasetDict
from accelerate import PartialState
import numpy as np
import os
import logging
# logging.basicConfig(level=logging.ERROR)
def main(model_args, data_args, training_args):
model = AutoModelForSequenceClassification.from_pretrained(
pretrained_model_name_or_path=model_args.model_name_or_path,
num_labels=len(LABELS),
problem_type="multi_label_classification",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=model_args.model_name_or_path,
trust_remote_code=True
)
with PartialState().main_process_first():
dataset = DatasetDict()
dataset["train"] = load_dataset("json", data_files=data_args.trainset_path, cache_dir=data_args.cache_dir, split="train")
dataset["test"] = load_dataset("json", data_files=data_args.testset_path, cache_dir=data_args.cache_dir, split="train")
dataset = dataset.map(preprocess,
fn_kwargs={
"tokenizer": tokenizer,
"data_args": data_args
},
num_proc=4)
def compute_metrics(eval_prediction: EvalPrediction):
def sigmoid(x):
return 1 / (1 + np.exp(-x))
logits, labels = eval_prediction
logits = sigmoid(logits)
predictions = np.where(logits>=0.5, 1, 0)
fine_report = classification_report(labels, predictions, zero_division=0, target_names=LABELS)
fine_f1_score = f1_score(labels, predictions, average='macro', zero_division=0)
labels = np.max(labels, axis=1)
predictions = np.max(predictions, axis=1)
coarse_report = classification_report(labels, predictions)
if PartialState().is_local_main_process:
print(fine_report, '\n')
print(coarse_report, '\n\n')
print()
return {
'f1_score': fine_f1_score
}
collator = DataCollatorForBERT(
tokenizer=tokenizer,
max_length=data_args.max_seq_length
)
trainer = CustomTrainer(
model=model,
args=training_args,
data_collator=collator,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
use_focalloss=model_args.use_focalloss
)
trainer.train()
trainer.save_model(os.path.join(training_args.output_dir, "best_model"))
if __name__ == "__main__":
parser = HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
main(model_args, data_args, training_args)
and the utils.py is here:
import torch
from torch import nn, Tensor, LongTensor
from transformers import Trainer
import torch
from transformers import BertTokenizer
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from pypinyin import lazy_pinyin
LABELS = [""]
label2ids = {label: i for i, label in enumerate(LABELS)}
id2labels = {i: label for i, label in enumerate(LABELS)}
@dataclass
class DataArguments:
trainset_path: Optional[str] = field(
default=None,
metadata={"help": "The preference trainset to use."},
)
testset_path: Optional[str] = field(
default=None,
metadata={"help": "The preference dataset to use."},
)
predict_file_name: Optional[str] = field(
default="results.jsonl",
metadata={"help": "The file name of the results."}
)
cache_dir: Optional[str] = field(
default="cache",
metadata={"help": "The file name of the cache."}
)
max_seq_length: Optional[int] = field(default=512)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
model_type: str = field(
default="bert"
)
use_focalloss: bool = field(
default=True
)
class FocalLossWithBCE(nn.Module):
def __init__(self,
alpha: float = 0.25,
gamma: float = 2,
reduce: str = "mean",
eps: float = 1e-7):
"""
Focal Loss with Binary Cross Entropy (BCE) Implementation
Args:
alpha (float): Weighting factor to balance class frequencies. Defaults to 0.25.
gamma (float): Focusing parameter to adjust the rate at which easy examples are down-weighted. Defaults to 2.
reduce (str): Specifies the reduction to apply to the output: 'mean' | 'sum'. Defaults to 'mean'.
eps (float): A small value to prevent division by zero. Defaults to 1e-7.
"""
super().__init__()
# Class weights
self.class_weights = torch.Tensor([alpha, 1 - alpha])
self.gamma = gamma
self.reduce = reduce
self.eps = eps
def forward(self, input: Tensor, target: Tensor) -> Tensor:
"""
Forward pass of the Focal Loss with BCE
Args:
input (Tensor): Input tensor from the model (raw scores/logits).
target (Tensor): Ground truth labels.
Returns:
Tensor: Computed loss value.
"""
if not isinstance(target, LongTensor):
target = target.long() # Convert to CELoss Format, (B, S)
# Compute positive and negative class probabilities
positive_input = torch.sigmoid(input) # (B, S, 1)
negative_input = 1 - positive_input # (B, S, 1)
input_probabilities = torch.stack((negative_input, positive_input), dim=-1) # (B, S, 2)
# Compute log probabilities
log_input_probabilities = torch.log(input_probabilities + self.eps)
# Compute class-specific probabilities and weights
preds_prob = torch.gather(input_probabilities.view(-1, 2), dim=1, index=target.view(-1, 1)) # (BxS, 1)
preds_log_prob = torch.gather(log_input_probabilities.view(-1, 2), dim=1, index=target.view(-1, 1)) # (BxS, 1)
class_weights = self.class_weights.to(target.device).gather(0, target.view(-1)) # (BxS)
# Compute focal loss
loss = -torch.mul(torch.pow((1 - preds_prob), self.gamma), preds_log_prob)
loss = torch.mul(class_weights, loss.t())
# Reduce loss according to specified method
if self.reduce == "mean":
loss = loss.mean()
elif self.reduce == "sum":
loss = loss.sum()
return loss
class CustomTrainer(Trainer):
def __init__(self, *args, use_focalloss=False, **kwargs):
super().__init__(*args, **kwargs)
self.focal_loss = FocalLossWithBCE()
self.use_focalloss = use_focalloss
def compute_loss(self, model, inputs, return_outputs=False):
if self.use_focalloss and "labels" in inputs:
labels = inputs.pop("labels")
else:
labels = None
outputs = model(**inputs)
if self.args.past_index >= 0:
self._past = outputs[self.args.past_index]
if labels is not None:
logits = outputs.get("logits") if isinstance(outputs, dict) else outputs[1]
loss = self.focal_loss(logits, labels)
else:
if isinstance(outputs, dict) and "loss" not in outputs:
raise ValueError(
"The model did not return a loss from the inputs, only the following keys: "
f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
)
loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
return (loss, outputs) if return_outputs else loss
@dataclass
class DataCollatorForBERT:
tokenizer: BertTokenizer
max_length: int
def __call__(self, features, *args, **kwargs):
batch = self.tokenizer.pad(
features,
padding=True,
max_length=self.max_length,
return_tensors="pt"
)
return batch
def preprocess(example, tokenizer:BertTokenizer, data_args:DataArguments):
if isinstance(example, str):
example = {
"text": example
}
batch_encoded = tokenizer.encode_plus(text=example["text"],
add_special_tokens=True,
max_length=data_args.max_seq_length,
truncation=True)
labels = [0.0] * len(LABELS)
if "labels" in example:
if isinstance(example["labels"], list):
targets = example["labels"]
else:
targets = example["labels"].replace(",", ",").split(",")
for label in targets:
if label in LABELS:
label_id = label2ids[label]
labels[label_id] = 1.0
return {
**batch_encoded,
"labels": labels
}
@muellerzr I understand, I do the same. However, if I forget to change the parameters, the actual batch size on 2x GPU should be twice that of a single GPU, but the performance on a single GPU is better than on multiple GPUs in fact. This seems unreasonable to me, so I don't think it's a batch size issue.
There is a userwarning:/root/miniconda3/lib/python3.11/site-packages/torch/autograd/graph.py:768: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [312, 312], strides() = [312, 1] bucket_view.sizes() = [312, 312], strides() = [1, 312] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:327.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
May it cause this problem above?
It's been a while since I encountered this error. During this period, I made many parameter changes and data adjustments. When I tried to reproduce the issue, it seemed that the results were normal. Additionally, the previous error did not occur. Thank you for your patience and support.
System Info
transformers>=4.43.2 accelerate==0.33.0 GPUs: 2 x A100
Who can help?
@ArthurZucker @muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am currently using the transformers library (version >= 4.43.2) for training BERT models. I have observed a significant performance discrepancy when training with a single GPU compared to training with multiple GPUs. Specifically, the performance difference can reach up to 10 percentage points.
Additionally, I have noticed that this issue persists in lower versions of the library, such as 4.42.1.
Expected behavior
Same performance between single gpu and multi-gpu.