Open Lihwnlp opened 5 days ago
Perhaps this is not the solution to your question but additional information. I posted my issue yesterday as well on a slightly different topic but I was evaluating similar to your scenario as well.
When I run my self contained script on one or multiple GPUs then the memory utilization on the same model is as follows.
也许这不是您问题的解决方案,而是其他信息。我昨天也发布了我的问题,主题略有不同,但我的评估也与您的情况类似。
当我在一个或多个 GPU 上运行我的独立脚本时,同一型号上的内存利用率如下。
- 单GPU - 32466MiB
- 两个 GPU - 26286MiB + 14288MiB = 40574MiB
- 因此,该比率是 25% 的开销,因为 2 个 GPU,因此使用了两个优化器数据/梯度副本等。
And when I use one graphics card, the time is 3 hours. However, using two graphics cards actually increases the time, reaching 12 hours. I am confused about this😵
You did not write what form of parallelism you are using, FSDP, DeepSpeed, DDP? Are you using accelerate? What are the configs, how do you launch the scripts?
你没有写你使用什么形式的并行性,FSDP、DeepSpeed、DDP?你在使用加速吗?配置是什么,如何启动脚本?
import argparse parser = argparse.ArgumentParser() parser.add_argument( '-e', '--n_epochs', type=int, help="number of epochs", default=1, ) args = parser.parse_args() n_epochs = int(args.n_epochs)
seed = 42
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
import torch from datasets import load_dataset from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, logging, ) from peft import LoraConfig, peft_model, TaskType from trl import SFTTrainer, SFTConfig
torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed(seed)
tokenizer = AutoTokenizer.from_pretrained( model_id,
add_eos_token=True,
#trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id tokenizer.padding_side = "right"
def generate_supervised_chat(row): chat = [ { 'role': 'system', 'content': '你是一位优秀的翻译专家。请把给定的中文文本翻译为日语,只回复翻译后的文本。'}, { 'role': 'user', 'content': f'''请把下面的中文文本翻译为日语文本。 中文文本: {row["Ch"]}''' }, { 'role': 'assistant', 'content': f'''此文本翻译后的结果如下。 日语翻译文本: {row["Ja"]} 以上。'''}, ] instruction = tokenizer.apply_chat_template(chat, tokenize=False)
return instruction
def add_text(row): row['text'] = generate_supervised_chat(row) return row
jjs_dataset_dir = "wccjc-dataset" dataset = load_dataset( jjs_dataset_dir, data_files={'train': 'trainall.tsv', 'test': 'test.tsv', 'valid': 'valid.tsv'}, sep='\t', names=['Ch', 'Ja'] )
dataset = dataset["train"] dataset = dataset.map(add_text) print(dataset) print(dataset[0]["text"])
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, # or float16 bnb_4bit_use_double_quant=True, )
import datetime
now = datetime.datetime.now() print('Loading base model:', model_id, now) print('Train epochs:', n_epochs)
model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", #{"": 0}, ) now = datetime.datetime.now() print('Loading ended', now) model.config.use_cache = False model.config.pretraining_tp = 1
lora_config = LoraConfig( r=8, lora_alpha=32, lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, # "CAUSUAL_LM", target_modules=["q_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "k_proj", "v_proj"], )
per_device_train_batch_size = 4 gradient_accumulation_steps = 4 print("per_device_train_batch_size:", per_device_train_batch_size) print("gradient_accumulation_steps:", gradient_accumulation_steps)
sft_config = SFTConfig( output_dir="./train_logs", fp16=True, seed=42,
num_train_epochs=n_epochs,
per_device_train_batch_size=per_device_train_batch_size, #4,
gradient_accumulation_steps=gradient_accumulation_steps, # 1,
optim="paged_adamw_32bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
max_grad_norm=0.3,
warmup_ratio=0.03,
weight_decay=0.001,
save_steps=1000, #25,
logging_steps=25,
group_by_length=True,
report_to="tensorboard",
max_seq_length=512, #None
dataset_text_field="text",
)
trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, peft_config=lora_config,
args=sft_config,
packing=False,
)
import datetime now = datetime.datetime.now()
print('training...', now) trainer.train() now = datetime.datetime.now() print('training ended', now) print('saving model') trainer.save_model(f'./jjs_llama3_lora_model-2x3-ep{n_epochs}')import argparse parser = argparse.ArgumentParser() parser.add_argument( '-e', '--n_epochs', type=int, help="number of epochs", default=1, ) args = parser.parse_args() n_epochs = int(args.n_epochs)
seed = 42
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
import torch from datasets import load_dataset from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, logging, ) from peft import LoraConfig, peft_model, TaskType from trl import SFTTrainer, SFTConfig
torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed(seed)
tokenizer = AutoTokenizer.from_pretrained( model_id,
add_eos_token=True,
#trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id tokenizer.padding_side = "right"
def generate_supervised_chat(row): chat = [ { 'role': 'system', 'content': '你是一位优秀的翻译专家。请把给定的中文文本翻译为日语,只回复翻译后的文本。'}, { 'role': 'user', 'content': f'''请把下面的中文文本翻译为日语文本。 中文文本: {row["Ch"]}''' }, { 'role': 'assistant', 'content': f'''此文本翻译后的结果如下。 日语翻译文本: {row["Ja"]} 以上。'''}, ] instruction = tokenizer.apply_chat_template(chat, tokenize=False)
return instruction
def add_text(row): row['text'] = generate_supervised_chat(row) return row
jjs_dataset_dir = "wccjc-dataset" dataset = load_dataset( jjs_dataset_dir, data_files={'train': 'trainall.tsv', 'test': 'test.tsv', 'valid': 'valid.tsv'}, sep='\t', names=['Ch', 'Ja'] )
dataset = dataset["train"] dataset = dataset.map(add_text) print(dataset) print(dataset[0]["text"])
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, # or float16 bnb_4bit_use_double_quant=True, )
import datetime
now = datetime.datetime.now() print('Loading base model:', model_id, now) print('Train epochs:', n_epochs)
model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", #{"": 0}, ) now = datetime.datetime.now() print('Loading ended', now) model.config.use_cache = False model.config.pretraining_tp = 1
lora_config = LoraConfig( r=8, lora_alpha=32, lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, # "CAUSUAL_LM", target_modules=["q_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "k_proj", "v_proj"], )
per_device_train_batch_size = 4 gradient_accumulation_steps = 4 print("per_device_train_batch_size:", per_device_train_batch_size) print("gradient_accumulation_steps:", gradient_accumulation_steps)
sft_config = SFTConfig( output_dir="./train_logs", fp16=True, seed=42,
num_train_epochs=n_epochs,
per_device_train_batch_size=per_device_train_batch_size, #4,
gradient_accumulation_steps=gradient_accumulation_steps, # 1,
optim="paged_adamw_32bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
max_grad_norm=0.3,
warmup_ratio=0.03,
weight_decay=0.001,
save_steps=1000, #25,
logging_steps=25,
group_by_length=True,
report_to="tensorboard",
max_seq_length=512, #None
dataset_text_field="text",
)
trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, peft_config=lora_config,
args=sft_config,
packing=False,
)
import datetime now = datetime.datetime.now()
print('training...', now) trainer.train() now = datetime.datetime.now() print('training ended', now) print('saving model') trainer.save_model(f'./jjs_llama3_lora_model-2x3-ep{n_epochs}')
This is all my code, 【device_map="auto"】can't I enable multiple graphics card
So are you running your script just with python train.py
? This will not be sufficient for parallelism. I would recommend to use accelerate and check the options there. Note that you don't need to explicitly create the accelerator
instance etc., since SFTTrainer
already takes care of that. But you still need to choose your parallelism strategy (DDP, FSDP, DS), configure accelerate accordingly, and then run accelerate launch train.py
.
So are you running your script just with
python train.py
? This will not be sufficient for parallelism. I would recommend to use accelerate and check the options there. Note that you don't need to explicitly create theaccelerator
instance etc., sinceSFTTrainer
already takes care of that. But you still need to choose your parallelism strategy (DDP, FSDP, DS), configure accelerate accordingly, and then runaccelerate launch train.py
.
Thank you.
System Info
peft=0.11.1 python=3.10
Who can help?
When I run this script, there is no problem with a single GPU. When I try to run 2 GPUs, the system resources show that the utilization rate of each GPU is only half. When I try to increase per-device_train_batch_size and gradient-accumulation_steps, there is a situation of memory overflow. What should I do?
Information
Tasks
examples
folderReproduction
Expected behavior
run 2 GPUs