Closed mathis-lambert closed 7 months ago
Ok ! i've fixed it for now and fine-tuning is running :
if you have the same issue, don't forget to add some args while loading the model like :
low_cpu_mem_usage=True, torch_dtype=torch.float16
it should fix the memory error.
Also be careful how you're loading your dataset !
If you have any questions, let me know
Hi, Unfortunately, the DPO Finetuning didn't go well. After many tries at the 50th percent, it's throwing this error :
Traceback (most recent call last):
File "/home/user/test/FT/dpo_finetuning.py", line 159, in <module>
File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1556, in train
return inner_training_loop(
File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1934, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2283, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3026, in evaluate
output = eval_loop(
File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 616, in evaluation_loop
policy_output_decoded, ref_output_decoded = self.get_batch_samples(self.model, random_batch)
File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 513, in get_batch_samples
policy_output = model.generate(
File "/home/user/test/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/test/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1800, in generate
return self.sample(
File "/home/user/test/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2933, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I tried to decrease Learning Rate but it doesn't change ...
Any suggestion ?
Looks indeed like a divergence issue. Have you tried with the default dataset before using your own custom one?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hi, Unfortunately, the DPO Finetuning didn't go well. After many tries at the 50th percent, it's throwing this error :
Traceback (most recent call last): File "/home/user/test/FT/dpo_finetuning.py", line 159, in <module> File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1556, in train return inner_training_loop( File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1934, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2283, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3026, in evaluate output = eval_loop( File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 616, in evaluation_loop policy_output_decoded, ref_output_decoded = self.get_batch_samples(self.model, random_batch) File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 513, in get_batch_samples policy_output = model.generate( File "/home/user/test/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/user/test/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1800, in generate return self.sample( File "/home/user/test/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2933, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I tried to decrease Learning Rate but it doesn't change ...
Any suggestion ?
@mathis-lambert Did you solve this? I'm getting exactly the same in iteration 500 with the default dataset.
Sure, thanks a lot, enjoy your holiday!
On Tue, Dec 26, 2023 at 2:48 PM Mathis @.***> wrote:
Hi @guyadamSC https://github.com/guyadamSC
Yes, if I remember well, I managed to handle this issue but I don't remember how. I'm currently in vacation, so I might be back with an answer in some days.
Merry Christmas and Happy holidays 🎄
— Reply to this email directly, view it on GitHub https://github.com/huggingface/trl/issues/1024#issuecomment-1869524117, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXAQDQXO4C4KWPWVFCZ4QNTYLLBQDAVCNFSM6AAAAAA7WPMHY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRZGUZDIMJRG4 . You are receiving this because you were mentioned.Message ID: @.***>
i have the same problem! It also threw the same error at 50%. Did you solve this?
HI @guyadamSC @AIR-hl,
I'm back to try to help you ...
Please can you explain a bit deeper what dataset and model are u using, and on what hardware ? And of course, what's your trainer config (batch_size, gradient _accumulation ...)
Maybe i'll be able to help you with these info
Regard
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
hello! i meet the same problem, i guess my data is too long ,i use A100 40G and max_prompt length is 512,max_length is 1024, but i cant change the training data ,what should i do to solve the cuda out of memory error? thanks!!!
hello! i meet the same problem, i guess my data is too long ,i use A100 40G and max_prompt length is 512,max_length is 1024, but i cant change the training data ,what should i do to solve the cuda out of memory error? thanks!!!
Hi ! What model are u using ?
hello , my model is codellama-7b-python and i do the SFT before the DPO , and here is my dataset format: , now i am using the batch_size 1 and per_decive_train(eval)_batch_size 1 and gradient_accumulation_steps 4 , it starts training and the output is like: i dont know if the data is normal
here is the code
`import os from dataclasses import dataclass, field from typing import Dict, Optional import json import torch from accelerate import Accelerator from datasets import Dataset, load_dataset from peft import LoraConfig from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, TrainingArguments, set_seed
from trl import DPOTrainer os.environ["WANDB_DISABLED"]="true" class ScriptArguments(): beta = 0.1 model_name_or_path= "/public/home/zhouxiabing/data/xinj/RL_Model/model" learning_rate= 1.4e-6 lr_scheduler_type = "cosine"
warmup_steps = 100
optimizer_type = "adamw_torch"
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
batch_size=1
gradient_accumulation_steps = 4
lora_alpha = 16
lora_dropout=0.05
lora_r = 64
max_prompt_length=256
max_length= 512
max_steps=1000
logging_steps=10
save_steps= 100
eval_steps= 100
output_dir="./results"
load_best_model_at_end=True
save_total_limit=3
def get_dataset( data_dir:str ):
dataset = load_dataset('json',data_files=data_dir, split="train")
def map_data(samples):
instruct="Given a piece of code, output the corresponding implementation idea.\n"
return {
"prompt":instruct + samples['text'],
"chosen":samples["chosen"],
"rejected":samples['rejected']
}
return dataset.map(map_data)
if name=="main":
# args=parser.parse_args_into_dataclasses()[0]
args= ScriptArguments()
#1. 加载预训练模型
model = AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
load_in_8bit=True,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
device_map="cuda:0")
model.config.use_cache = True
tokenizer = AutoTokenizer.from_pretrained("/public/home/zhouxiabing/data/xinj/codellama-7b-python")
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"
#2.加载数据集
train_dataset= get_dataset("new_train.jsonl")
# train_dataset = train_dataset.filter(
# lambda x: len(x["prompt"]) + len(x["chosen"]) <= args.max_length
# and len(x["prompt"]) + len(x["rejected"]) <= args.max_length
# )
eval_dataset= get_dataset("new_test.jsonl")
# eval_dataset = eval_dataset.filter(
# lambda x: len(x["prompt"]) + len(x["chosen"]) <= args.max_length
# and len(x["prompt"]) + len(x["rejected"]) <= args.max_length
# )
#3. 准备training args
training_args= TrainingArguments(
per_device_eval_batch_size=args.per_device_eval_batch_size,
per_device_train_batch_size=args.per_device_train_batch_size,
gradient_accumulation_steps=args.gradient_accumulation_steps,
logging_steps=args.logging_steps,
eval_steps=100,
save_steps=100,
num_train_epochs= 2,
optim="adamw_torch",
evaluation_strategy="steps", # if val_set_size > 0 else "no",
save_strategy="steps",
save_total_limit=3,
fp16=True,
output_dir=args.output_dir,
load_best_model_at_end=True,
learning_rate= args.learning_rate,
warmup_steps=50
)
config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=[
"v_proj",
"up_proj",
"down_proj",
"k_proj",
"gate_proj",
"q_proj",
"o_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
#4.准备DPO trainer
dpo_trainer=DPOTrainer(
model,
ref_model=None,
args=training_args,
beta=0,
train_dataset = train_dataset,
eval_dataset = eval_dataset,
tokenizer = tokenizer,
peft_config = config,
max_prompt_length =256,
max_length = 512
)
#5.训练
dpo_trainer.train()
dpo_trainer.save_model(args.output_dir)
output_dir=os.path.join(args.output_dir,"final_checkpoint")
# dpo_trainer.model.save_pretrained(output_dir)
`
@virt9 what's exactly the type of the model u're loading ?
hi,@mathis-lambert ,
the model is lora safetensors after SFT ,
and the config is :
{ "alpha_pattern": {}, "auto_mapping": null, "base_model_name_or_path": "/public/home/zhouxiabing/data/xinj/codellama-7b-python", "bias": "none", "fan_in_fan_out": false, "inference_mode": true, "init_lora_weights": true, "layer_replication": null, "layers_pattern": null, "layers_to_transform": null, "loftq_config": {}, "lora_alpha": 16, "lora_dropout": 0.05, "megatron_config": null, "megatron_core": "megatron.core", "modules_to_save": null, "peft_type": "LORA", "r": 64, "rank_pattern": {}, "revision": null, "target_modules": [ "v_proj", "down_proj", "q_proj", "up_proj", "gate_proj", "k_proj", "o_proj" ], "task_type": "CAUSAL_LM", "use_dora": false, "use_rslora": false }
@virt9 Okay,
@mathis-lambert hello,now i am following the code i mentioned above to train the model and it keeps training with no cuda error , but the loss is nan and the rewards is nan , now i am going to solve the new problems , thanks a lot for your help !!
@mathis-lambert hello,now i am following the code i mentioned above to train the model and it keeps training with no cuda error , but the loss is nan and the rewards is nan , now i am going to solve the new problems , thanks a lot for your help !!
Ok nice you don't have CUDA OUT OF MEMORY error You can stop the training, it is failed. I already experienced this issue, did the loss and rewards had a value till epoch 50 ?
@mathis-lambert hello,now i am following the code i mentioned above to train the model and it keeps training with no cuda error , but the loss is nan and the rewards is nan , now i am going to solve the new problems , thanks a lot for your help !!
Ok nice you don't have CUDA OUT OF MEMORY error You can stop the training, it is failed. I already experienced this issue, did the loss and rewards had a value till epoch 50 ?
oh yes ,to prevent the cuda out of memory .i set all batch_size 1 ,it works for me. so i have 1770+ steps ,and the first hundreds steps the value is right but now it is 0 ,that's why ?
@virt9 damn I've got this same behavior but i can't remember how i solved it ... Then, for the batch_size yup it's normal you can't go above 1. You only have 40gigs of VRAM and a Lora fine tuning loads the weights in FP32 so it takes 4 times the parameter size in gigs : 7B model in FP32 should weigh around 28GB + Dataset size. That's why.
if the loss is 0, maybe the fp16 is overflow, please try bf16
---Original---
From: @.>
Date: Sat, May 4, 2024 18:14 PM
To: @.>;
Cc: "Shihyueh @.**@.>;
Subject: Re: [huggingface/trl] Cuda OutOfMemory error for DPO Finetuning
(Issue #1024)
@mathis-lambert hello,now i am following the code i mentioned above to train the model and it keeps training with no cuda error , but the loss is nan and the rewards is nan , now i am going to solve the new problems , thanks a lot for your help !!
Ok nice you don't have CUDA OUT OF MEMORY error You can stop the training, it is failed. I already experienced this issue, did the loss and rewards had a value till epoch 50 ?
oh yes ,to prevent the cuda out of memory .i set all batch_size 1 ,it works for me. so i have 1770+ steps ,and the first hundreds steps the value is right but now it is 0 ,that's why ?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
With this
´torch_dtype=torch.bfloat16´
@mathis-lambert thanks a lot , i will try it later
@mathis-lambert sorry to bother you again, i try bf16 and after 1200steps the loss is 0 again ,and i found a solution to higher my batch_size, i make it from 1 to 2 ,but cuda out of memory error comes again. did you meet this? thanks!
sorry, i have no idea. maybe you should make an new issue again
---Original---
From: @.>
Date: Sun, May 5, 2024 11:09 AM
To: @.>;
Cc: "Shihyueh @.**@.>;
Subject: Re: [huggingface/trl] Cuda OutOfMemory error for DPO Finetuning
(Issue #1024)
@mathis-lambert sorry to bother you again, i try bf16 and after 1200steps the loss is 0 again ,and i found a solution to higher my batch_size, i make it from 1 to 2 ,but cuda out of memory error comes again. did you meet this? thanks!
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
OK, thanks a lot
Hi, I'm working on finetuning Zephyr7b with my own dataset with the DPO form-factor :
I have used the code you provide in your examples :
dpo.py
and i ran it on an H100 instance with 80Gigs of VRAM, but i have this error :I've changed the batch size but it doesn't make it work.
Any hint ?
Thanks a lot