Lihwnlp commented 5 days ago

System Info

peft=0.11.1 python=3.10

Who can help?

When I run this script, there is no problem with a single GPU. When I try to run 2 GPUs, the system resources show that the utilization rate of each GPU is only half. When I try to increase per-device_train_batch_size and gradient-accumulation_steps, there is a situation of memory overflow. What should I do?

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[X] My own task or dataset (give details below)

Reproduction

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    logging,
)
from peft import LoraConfig, peft_model, TaskType
from trl import SFTTrainer, SFTConfig

# fix random sequence
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    # use_fast=False,
    add_eos_token=True,
    #trust_remote_code=True,
)
#tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right"

# Generate Llama 3 instruction
def generate_supervised_chat(row):
    chat = [
        {   'role': 'system',
            'content': '你是一位优秀的翻译专家。请把给定的中文文本翻译为日语，只回复翻译后的文本。'},
        {   'role': 'user',
            'content': f'''请把下面的中文文本翻译为日语文本。
中文文本: {row["Ch"]}''' },
        {   'role': 'assistant',
            'content': f'''此文本翻译后的结果如下。
日语翻译文本: {row["Ja"]}
以上。'''},
     ]
    instruction = tokenizer.apply_chat_template(chat, tokenize=False)
    # instruction = instruction + "<|end_of_text|>"
    return instruction

def add_text(row):
    row['text'] = generate_supervised_chat(row)
    return row

# load dataset
jjs_dataset_dir = "wccjc-dataset"
dataset = load_dataset(
    jjs_dataset_dir,
    data_files={'train': 'train.tsv', 'test': 'test.tsv', 'valid': 'valid.tsv'},
    sep='\t',
    names=['Ch', 'Ja']
)

dataset = dataset["train"]
dataset = dataset.map(add_text)
print(dataset)
print(dataset[0]["text"])

# Quantization Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16, # or float16
    bnb_4bit_use_double_quant=True,
)

import datetime

# Load pretrained model
now = datetime.datetime.now()
print('Loading base model:', model_id, now)
print('Train epochs:', n_epochs)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto", #{"": 0},
)
now = datetime.datetime.now()
print('Loading ended', now)
model.config.use_cache = False
model.config.pretraining_tp = 1

# LoRA Config
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM, # "CAUSUAL_LM",
    target_modules=["q_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "k_proj", "v_proj"],
)

per_device_train_batch_size = 4
gradient_accumulation_steps = 4
print("per_device_train_batch_size:", per_device_train_batch_size)
print("gradient_accumulation_steps:", gradient_accumulation_steps)
# Training arguments
sft_config = SFTConfig(
    output_dir="./train_logs",
    fp16=True,
    seed=42,
    # max_steps=13200, # 300,
    num_train_epochs=n_epochs,
    per_device_train_batch_size=per_device_train_batch_size, #4,
    gradient_accumulation_steps=gradient_accumulation_steps, # 1,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    weight_decay=0.001,
    save_steps=1000, #25,
    logging_steps=25,
    group_by_length=True,
    report_to="tensorboard",
    max_seq_length=512, #None
    dataset_text_field="text",
)

# SFT arguments
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    peft_config=lora_config,
    # args=training_arguments,
    args=sft_config,
    packing=False,
)

Expected behavior

run 2 GPUs

tomekrut commented 5 days ago

Perhaps this is not the solution to your question but additional information. I posted my issue yesterday as well on a slightly different topic but I was evaluating similar to your scenario as well.

When I run my self contained script on one or multiple GPUs then the memory utilization on the same model is as follows.

Single GPU - 32466MiB
Two GPUs - 26286MiB + 14288MiB = 40574MiB
so the ratio is 25% overhead just because 2 GPUs so two copies of optimizer data / gradients are used etc.

Lihwnlp commented 5 days ago

也许这不是您问题的解决方案，而是其他信息。我昨天也发布了我的问题，主题略有不同，但我的评估也与您的情况类似。

当我在一个或多个 GPU 上运行我的独立脚本时，同一型号上的内存利用率如下。

单GPU - 32466MiB

两个 GPU - 26286MiB + 14288MiB = 40574MiB

因此，该比率是 25% 的开销，因为 2 个 GPU，因此使用了两个优化器数据/梯度副本等。

And when I use one graphics card, the time is 3 hours. However, using two graphics cards actually increases the time, reaching 12 hours. I am confused about this😵

BenjaminBossan commented 5 days ago

You did not write what form of parallelism you are using, FSDP, DeepSpeed, DDP? Are you using accelerate? What are the configs, how do you launch the scripts?

Lihwnlp commented 5 days ago

你没有写你使用什么形式的并行性，FSDP、DeepSpeed、DDP？你在使用加速吗？配置是什么，如何启动脚本？

import argparse parser = argparse.ArgumentParser() parser.add_argument( '-e', '--n_epochs', type=int, help="number of epochs", default=1, ) args = parser.parse_args() n_epochs = int(args.n_epochs)

random seed

seed = 42

model

model_id = "elyza/ELYZA-japanese-Llama-2-7b-fast-instruct"

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

model_id = "rinna/llama-3-youko-8b"

model_id = "nk2t/Llama-3-8B-Instruct-japanese-nk2t-v0.3"

import torch from datasets import load_dataset from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, logging, ) from peft import LoraConfig, peft_model, TaskType from trl import SFTTrainer, SFTConfig

fix random sequence

torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed(seed)

Load tokenizer

tokenizer = AutoTokenizer.from_pretrained( model_id,

use_fast=False,

add_eos_token=True,
#trust_remote_code=True,

)

tokenizer.pad_token = tokenizer.unk_token

tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id tokenizer.padding_side = "right"

Generate Llama 3 instruction

def generate_supervised_chat(row): chat = [ { 'role': 'system', 'content': '你是一位优秀的翻译专家。请把给定的中文文本翻译为日语，只回复翻译后的文本。'}, { 'role': 'user', 'content': f'''请把下面的中文文本翻译为日语文本。中文文本: {row["Ch"]}''' }, { 'role': 'assistant', 'content': f'''此文本翻译后的结果如下。日语翻译文本: {row["Ja"]} 以上。'''}, ] instruction = tokenizer.apply_chat_template(chat, tokenize=False)

instruction = instruction + "<|end_of_text|>"

return instruction

def add_text(row): row['text'] = generate_supervised_chat(row) return row

load dataset

jjs_dataset_dir = "wccjc-dataset" dataset = load_dataset( jjs_dataset_dir, data_files={'train': 'trainall.tsv', 'test': 'test.tsv', 'valid': 'valid.tsv'}, sep='\t', names=['Ch', 'Ja'] )

dataset = dataset["train"] dataset = dataset.map(add_text) print(dataset) print(dataset[0]["text"])

Quantization Config

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, # or float16 bnb_4bit_use_double_quant=True, )

import datetime

Load pretrained model

now = datetime.datetime.now() print('Loading base model:', model_id, now) print('Train epochs:', n_epochs)

model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", #{"": 0}, ) now = datetime.datetime.now() print('Loading ended', now) model.config.use_cache = False model.config.pretraining_tp = 1

LoRA Config

lora_config = LoraConfig( r=8, lora_alpha=32, lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, # "CAUSUAL_LM", target_modules=["q_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "k_proj", "v_proj"], )

per_device_train_batch_size = 4 gradient_accumulation_steps = 4 print("per_device_train_batch_size:", per_device_train_batch_size) print("gradient_accumulation_steps:", gradient_accumulation_steps)

Training arguments

sft_config = SFTConfig( output_dir="./train_logs", fp16=True, seed=42,

max_steps=13200, # 300,

num_train_epochs=n_epochs,
per_device_train_batch_size=per_device_train_batch_size, #4,
gradient_accumulation_steps=gradient_accumulation_steps, # 1,
optim="paged_adamw_32bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
max_grad_norm=0.3,
warmup_ratio=0.03,
weight_decay=0.001,
save_steps=1000, #25,
logging_steps=25,
group_by_length=True,
report_to="tensorboard",
max_seq_length=512, #None
dataset_text_field="text",

)

SFT arguments

trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, peft_config=lora_config,

args=training_arguments,

args=sft_config,
packing=False,

)

import datetime now = datetime.datetime.now()

Start training

print('training...', now) trainer.train() now = datetime.datetime.now() print('training ended', now) print('saving model') trainer.save_model(f'./jjs_llama3_lora_model-2x3-ep{n_epochs}')import argparse parser = argparse.ArgumentParser() parser.add_argument( '-e', '--n_epochs', type=int, help="number of epochs", default=1, ) args = parser.parse_args() n_epochs = int(args.n_epochs)

random seed

seed = 42

model

model_id = "elyza/ELYZA-japanese-Llama-2-7b-fast-instruct"

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

model_id = "rinna/llama-3-youko-8b"

model_id = "nk2t/Llama-3-8B-Instruct-japanese-nk2t-v0.3"

import torch from datasets import load_dataset from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, logging, ) from peft import LoraConfig, peft_model, TaskType from trl import SFTTrainer, SFTConfig

fix random sequence

torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed(seed)

Load tokenizer

tokenizer = AutoTokenizer.from_pretrained( model_id,

use_fast=False,

add_eos_token=True,
#trust_remote_code=True,

)

tokenizer.pad_token = tokenizer.unk_token

tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id tokenizer.padding_side = "right"

Generate Llama 3 instruction

def generate_supervised_chat(row): chat = [ { 'role': 'system', 'content': '你是一位优秀的翻译专家。请把给定的中文文本翻译为日语，只回复翻译后的文本。'}, { 'role': 'user', 'content': f'''请把下面的中文文本翻译为日语文本。中文文本: {row["Ch"]}''' }, { 'role': 'assistant', 'content': f'''此文本翻译后的结果如下。日语翻译文本: {row["Ja"]} 以上。'''}, ] instruction = tokenizer.apply_chat_template(chat, tokenize=False)

instruction = instruction + "<|end_of_text|>"

return instruction

def add_text(row): row['text'] = generate_supervised_chat(row) return row

load dataset

jjs_dataset_dir = "wccjc-dataset" dataset = load_dataset( jjs_dataset_dir, data_files={'train': 'trainall.tsv', 'test': 'test.tsv', 'valid': 'valid.tsv'}, sep='\t', names=['Ch', 'Ja'] )

dataset = dataset["train"] dataset = dataset.map(add_text) print(dataset) print(dataset[0]["text"])

Quantization Config

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, # or float16 bnb_4bit_use_double_quant=True, )

import datetime

Load pretrained model

now = datetime.datetime.now() print('Loading base model:', model_id, now) print('Train epochs:', n_epochs)

model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", #{"": 0}, ) now = datetime.datetime.now() print('Loading ended', now) model.config.use_cache = False model.config.pretraining_tp = 1

LoRA Config

lora_config = LoraConfig( r=8, lora_alpha=32, lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, # "CAUSUAL_LM", target_modules=["q_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "k_proj", "v_proj"], )

per_device_train_batch_size = 4 gradient_accumulation_steps = 4 print("per_device_train_batch_size:", per_device_train_batch_size) print("gradient_accumulation_steps:", gradient_accumulation_steps)

Training arguments

sft_config = SFTConfig( output_dir="./train_logs", fp16=True, seed=42,

max_steps=13200, # 300,

num_train_epochs=n_epochs,
per_device_train_batch_size=per_device_train_batch_size, #4,
gradient_accumulation_steps=gradient_accumulation_steps, # 1,
optim="paged_adamw_32bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
max_grad_norm=0.3,
warmup_ratio=0.03,
weight_decay=0.001,
save_steps=1000, #25,
logging_steps=25,
group_by_length=True,
report_to="tensorboard",
max_seq_length=512, #None
dataset_text_field="text",

)

SFT arguments

trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, peft_config=lora_config,

args=training_arguments,

args=sft_config,
packing=False,

)

import datetime now = datetime.datetime.now()

Start training

print('training...', now) trainer.train() now = datetime.datetime.now() print('training ended', now) print('saving model') trainer.save_model(f'./jjs_llama3_lora_model-2x3-ep{n_epochs}')

This is all my code, 【device_map="auto"】can't I enable multiple graphics card

BenjaminBossan commented 5 days ago

So are you running your script just with python train.py? This will not be sufficient for parallelism. I would recommend to use accelerate and check the options there. Note that you don't need to explicitly create the accelerator instance etc., since SFTTrainer already takes care of that. But you still need to choose your parallelism strategy (DDP, FSDP, DS), configure accelerate accordingly, and then run accelerate launch train.py.

Lihwnlp commented 4 days ago

So are you running your script just with python train.py? This will not be sufficient for parallelism. I would recommend to use accelerate and check the options there. Note that you don't need to explicitly create the accelerator instance etc., since SFTTrainer already takes care of that. But you still need to choose your parallelism strategy (DDP, FSDP, DS), configure accelerate accordingly, and then run accelerate launch train.py.

Thank you.

huggingface / peft

How to use multiple GPUs #1903

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

random seed

model

model_id = "elyza/ELYZA-japanese-Llama-2-7b-fast-instruct"

model_id = "rinna/llama-3-youko-8b"

model_id = "nk2t/Llama-3-8B-Instruct-japanese-nk2t-v0.3"

fix random sequence

Load tokenizer

use_fast=False,

tokenizer.pad_token = tokenizer.unk_token

Generate Llama 3 instruction

instruction = instruction + "<|end_of_text|>"

load dataset

Quantization Config

Load pretrained model

LoRA Config

Training arguments

max_steps=13200, # 300,

SFT arguments

args=training_arguments,

Start training

random seed

model

model_id = "elyza/ELYZA-japanese-Llama-2-7b-fast-instruct"

model_id = "rinna/llama-3-youko-8b"

model_id = "nk2t/Llama-3-8B-Instruct-japanese-nk2t-v0.3"

fix random sequence

Load tokenizer

use_fast=False,

tokenizer.pad_token = tokenizer.unk_token

Generate Llama 3 instruction

instruction = instruction + "<|end_of_text|>"

load dataset

Quantization Config

Load pretrained model

LoRA Config

Training arguments

max_steps=13200, # 300,

SFT arguments

args=training_arguments,

Start training