huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.31k stars 1.17k forks source link

Cuda OutOfMemory error for `DPO Finetuning` #1024

Closed mathis-lambert closed 7 months ago

mathis-lambert commented 10 months ago

Hi, I'm working on finetuning Zephyr7b with my own dataset with the DPO form-factor :

{
   "prompt":[]str,
   "chosen":[]str,
   "rejected":[]str,
}

I have used the code you provide in your examples : dpo.py and i ran it on an H100 instance with 80Gigs of VRAM, but i have this error :

(venv) user@instance-h100-1:~/test/FT$ python dpo_finetuning.py --max_length=4096
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  6.27it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  6.41it/s]
/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py:176: UserWarning: When using DPODataCollatorWithPadding, you should set `max_prompt_length` in the DPOTrainer's init it will be set to `128` by default, but you should do it yourself in the future.
  warnings.warn(
[codecarbon INFO @ 15:57:46] [setup] RAM Tracking...
[codecarbon INFO @ 15:57:46] [setup] GPU Tracking...
[codecarbon INFO @ 15:57:46] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 15:57:46] [setup] CPU Tracking...
[codecarbon WARNING @ 15:57:46] No CPU tracking mode found. Falling back on CPU constant mode.
[codecarbon WARNING @ 15:57:47] We saw that you have a AMD EPYC 9334 32-Core Processor but we don't know it. Please contact us.
[codecarbon INFO @ 15:57:47] CPU Model on constant consumption mode: AMD EPYC 9334 32-Core Processor
[codecarbon INFO @ 15:57:47] >>> Tracker's metadata:
[codecarbon INFO @ 15:57:47]   Platform system: Linux-5.15.0-89-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 15:57:47]   Python version: 3.10.12
[codecarbon INFO @ 15:57:47]   CodeCarbon version: 2.2.3
[codecarbon INFO @ 15:57:47]   Available RAM : 235.941 GB
[codecarbon INFO @ 15:57:47]   CPU count: 24
[codecarbon INFO @ 15:57:47]   CPU model: AMD EPYC 9334 32-Core Processor
[codecarbon INFO @ 15:57:47]   GPU count: 1
[codecarbon INFO @ 15:57:47]   GPU model: 1 x NVIDIA H100 PCIe
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.16.0
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
  0%|                                                                                                                          | 0/1000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/user/test/FT/dpo_finetuning.py", line 163, in <module>
    dpo_trainer.train()
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 500, in compute_loss
    loss, metrics = self.get_batch_metrics(model, inputs, train_eval="train")
  File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 451, in get_batch_metrics
    ) = self.concatenated_forward(model, batch)
  File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 417, in concatenated_forward
    all_logits = model(
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 659, in forward
    return model_forward(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 647, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1045, in forward
    outputs = self.model(
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 932, in forward
    layer_outputs = decoder_layer(
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 621, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 262, in forward
    key_states = self.k_proj(hidden_states)
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 79.11 GiB of which 6.62 MiB is free. Including non-PyTorch memory, this process has 79.09 GiB memory in use. Of the allocated memory 77.89 GiB is allocated by PyTorch, and 641.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I've changed the batch size but it doesn't make it work.

Any hint ?

Thanks a lot

mathis-lambert commented 10 months ago

Ok ! i've fixed it for now and fine-tuning is running : if you have the same issue, don't forget to add some args while loading the model like : low_cpu_mem_usage=True, torch_dtype=torch.float16it should fix the memory error. Also be careful how you're loading your dataset !

If you have any questions, let me know

mathis-lambert commented 10 months ago

Hi, Unfortunately, the DPO Finetuning didn't go well. After many tries at the 50th percent, it's throwing this error : image

Traceback (most recent call last):
  File "/home/user/test/FT/dpo_finetuning.py", line 159, in <module>
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1934, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2283, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3026, in evaluate
    output = eval_loop(
  File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 616, in evaluation_loop
    policy_output_decoded, ref_output_decoded = self.get_batch_samples(self.model, random_batch)
  File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 513, in get_batch_samples
    policy_output = model.generate(
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1800, in generate
    return self.sample(
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2933, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

I tried to decrease Learning Rate but it doesn't change ...

Any suggestion ?

lvwerra commented 10 months ago

Looks indeed like a divergence issue. Have you tried with the default dataset before using your own custom one?

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

guyadamSC commented 8 months ago

Hi, Unfortunately, the DPO Finetuning didn't go well. After many tries at the 50th percent, it's throwing this error : image

Traceback (most recent call last):
  File "/home/user/test/FT/dpo_finetuning.py", line 159, in <module>
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1934, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2283, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3026, in evaluate
    output = eval_loop(
  File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 616, in evaluation_loop
    policy_output_decoded, ref_output_decoded = self.get_batch_samples(self.model, random_batch)
  File "/home/user/test/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 513, in get_batch_samples
    policy_output = model.generate(
  File "/home/user/test/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1800, in generate
    return self.sample(
  File "/home/user/test/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2933, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

I tried to decrease Learning Rate but it doesn't change ...

Any suggestion ?

@mathis-lambert Did you solve this? I'm getting exactly the same in iteration 500 with the default dataset.

guyadamSC commented 8 months ago

Sure, thanks a lot, enjoy your holiday!

On Tue, Dec 26, 2023 at 2:48 PM Mathis @.***> wrote:

Hi @guyadamSC https://github.com/guyadamSC

Yes, if I remember well, I managed to handle this issue but I don't remember how. I'm currently in vacation, so I might be back with an answer in some days.

Merry Christmas and Happy holidays 🎄

— Reply to this email directly, view it on GitHub https://github.com/huggingface/trl/issues/1024#issuecomment-1869524117, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXAQDQXO4C4KWPWVFCZ4QNTYLLBQDAVCNFSM6AAAAAA7WPMHY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRZGUZDIMJRG4 . You are receiving this because you were mentioned.Message ID: @.***>

AIR-hl commented 8 months ago

i have the same problem! It also threw the same error at 50%. Did you solve this?

mathis-lambert commented 8 months ago

HI @guyadamSC @AIR-hl,

I'm back to try to help you ...

Please can you explain a bit deeper what dataset and model are u using, and on what hardware ? And of course, what's your trainer config (batch_size, gradient _accumulation ...)

Maybe i'll be able to help you with these info

Regard

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

virt9 commented 4 months ago

hello! i meet the same problem, i guess my data is too long ,i use A100 40G and max_prompt length is 512,max_length is 1024, but i cant change the training data ,what should i do to solve the cuda out of memory error? thanks!!!

mathis-lambert commented 4 months ago

hello! i meet the same problem, i guess my data is too long ,i use A100 40G and max_prompt length is 512,max_length is 1024, but i cant change the training data ,what should i do to solve the cuda out of memory error? thanks!!!

Hi ! What model are u using ?

virt9 commented 4 months ago

hello , my model is codellama-7b-python and i do the SFT before the DPO , and here is my dataset format: image, now i am using the batch_size 1 and per_decive_train(eval)_batch_size 1 and gradient_accumulation_steps 4 , it starts training and the output is like: image i dont know if the data is normal

virt9 commented 4 months ago

here is the code

`import os from dataclasses import dataclass, field from typing import Dict, Optional import json import torch from accelerate import Accelerator from datasets import Dataset, load_dataset from peft import LoraConfig from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, TrainingArguments, set_seed

from trl import DPOTrainer os.environ["WANDB_DISABLED"]="true" class ScriptArguments(): beta = 0.1 model_name_or_path= "/public/home/zhouxiabing/data/xinj/RL_Model/model" learning_rate= 1.4e-6 lr_scheduler_type = "cosine"

warmup_steps = 100 
optimizer_type = "adamw_torch" 
per_device_train_batch_size = 1
per_device_eval_batch_size =  1 
batch_size=1
gradient_accumulation_steps = 4 

lora_alpha = 16 
lora_dropout=0.05
lora_r = 64

max_prompt_length=256
max_length= 512
max_steps=1000
logging_steps=10
save_steps= 100
eval_steps= 100

output_dir="./results"
load_best_model_at_end=True
save_total_limit=3

def get_dataset( data_dir:str ):

dataset = load_dataset('json',data_files=data_dir, split="train")
def map_data(samples):
    instruct="Given a piece of code, output the corresponding implementation idea.\n"
    return {
        "prompt":instruct + samples['text'],
        "chosen":samples["chosen"],
        "rejected":samples['rejected']
    }
return dataset.map(map_data)

if name=="main":

parser=HfArgumentParser(ScriptArguments)

# args=parser.parse_args_into_dataclasses()[0]
args= ScriptArguments()

#1. 加载预训练模型
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map="cuda:0")
model.config.use_cache = True

tokenizer = AutoTokenizer.from_pretrained("/public/home/zhouxiabing/data/xinj/codellama-7b-python")
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

#2.加载数据集
train_dataset= get_dataset("new_train.jsonl")

# train_dataset = train_dataset.filter(
#     lambda x: len(x["prompt"]) + len(x["chosen"]) <= args.max_length
#     and len(x["prompt"]) + len(x["rejected"]) <= args.max_length
# )

eval_dataset= get_dataset("new_test.jsonl")

# eval_dataset = eval_dataset.filter(
#     lambda x: len(x["prompt"]) + len(x["chosen"]) <= args.max_length
#     and len(x["prompt"]) + len(x["rejected"]) <= args.max_length
# )

#3. 准备training args
training_args= TrainingArguments(
    per_device_eval_batch_size=args.per_device_eval_batch_size,
    per_device_train_batch_size=args.per_device_train_batch_size,
    gradient_accumulation_steps=args.gradient_accumulation_steps,
    logging_steps=args.logging_steps,
    eval_steps=100,
    save_steps=100,
    num_train_epochs= 2,
    optim="adamw_torch",
    evaluation_strategy="steps", # if val_set_size > 0 else "no",
    save_strategy="steps",
    save_total_limit=3,
    fp16=True,
    output_dir=args.output_dir,
    load_best_model_at_end=True,
    learning_rate= args.learning_rate,
    warmup_steps=50
)
config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=[
    "v_proj",
    "up_proj",
    "down_proj",
    "k_proj",
    "gate_proj",
    "q_proj",
    "o_proj"
],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)       

#4.准备DPO trainer
dpo_trainer=DPOTrainer(
    model,
    ref_model=None,
    args=training_args,
    beta=0,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    tokenizer = tokenizer,
    peft_config = config,
    max_prompt_length =256,
    max_length = 512
)       

#5.训练
dpo_trainer.train()
dpo_trainer.save_model(args.output_dir)

output_dir=os.path.join(args.output_dir,"final_checkpoint")
# dpo_trainer.model.save_pretrained(output_dir)                         

`

mathis-lambert commented 4 months ago

@virt9 what's exactly the type of the model u're loading ?

virt9 commented 4 months ago

hi,@mathis-lambert , the model is lora safetensors after SFT , image and the config is : { "alpha_pattern": {}, "auto_mapping": null, "base_model_name_or_path": "/public/home/zhouxiabing/data/xinj/codellama-7b-python", "bias": "none", "fan_in_fan_out": false, "inference_mode": true, "init_lora_weights": true, "layer_replication": null, "layers_pattern": null, "layers_to_transform": null, "loftq_config": {}, "lora_alpha": 16, "lora_dropout": 0.05, "megatron_config": null, "megatron_core": "megatron.core", "modules_to_save": null, "peft_type": "LORA", "r": 64, "rank_pattern": {}, "revision": null, "target_modules": [ "v_proj", "down_proj", "q_proj", "up_proj", "gate_proj", "k_proj", "o_proj" ], "task_type": "CAUSAL_LM", "use_dora": false, "use_rslora": false }

mathis-lambert commented 4 months ago

@virt9 Okay,

virt9 commented 4 months ago

@mathis-lambert hello,now i am following the code i mentioned above to train the model and it keeps training with no cuda error , but the loss is nan and the rewards is nan , now i am going to solve the new problems , thanks a lot for your help !!

mathis-lambert commented 4 months ago

@mathis-lambert hello,now i am following the code i mentioned above to train the model and it keeps training with no cuda error , but the loss is nan and the rewards is nan , now i am going to solve the new problems , thanks a lot for your help !!

Ok nice you don't have CUDA OUT OF MEMORY error You can stop the training, it is failed. I already experienced this issue, did the loss and rewards had a value till epoch 50 ?

virt9 commented 4 months ago

@mathis-lambert hello,now i am following the code i mentioned above to train the model and it keeps training with no cuda error , but the loss is nan and the rewards is nan , now i am going to solve the new problems , thanks a lot for your help !!

Ok nice you don't have CUDA OUT OF MEMORY error You can stop the training, it is failed. I already experienced this issue, did the loss and rewards had a value till epoch 50 ?

oh yes ,to prevent the cuda out of memory .i set all batch_size 1 ,it works for me. so i have 1770+ steps ,and the first hundreds steps the value is right but now it is 0 ,that's why ?

mathis-lambert commented 4 months ago

@virt9 damn I've got this same behavior but i can't remember how i solved it ... Then, for the batch_size yup it's normal you can't go above 1. You only have 40gigs of VRAM and a Lora fine tuning loads the weights in FP32 so it takes 4 times the parameter size in gigs : 7B model in FP32 should weigh around 28GB + Dataset size. That's why.

AIR-hl commented 4 months ago

if the loss is 0, maybe the fp16 is overflow, please try bf16

---Original--- From: @.> Date: Sat, May 4, 2024 18:14 PM To: @.>; Cc: "Shihyueh @.**@.>; Subject: Re: [huggingface/trl] Cuda OutOfMemory error for DPO Finetuning(Issue #1024)

@mathis-lambert hello,now i am following the code i mentioned above to train the model and it keeps training with no cuda error , but the loss is nan and the rewards is nan , now i am going to solve the new problems , thanks a lot for your help !!

Ok nice you don't have CUDA OUT OF MEMORY error You can stop the training, it is failed. I already experienced this issue, did the loss and rewards had a value till epoch 50 ?

oh yes ,to prevent the cuda out of memory .i set all batch_size 1 ,it works for me. so i have 1770+ steps ,and the first hundreds steps the value is right but now it is 0 ,that's why ?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

mathis-lambert commented 4 months ago

With this

´torch_dtype=torch.bfloat16´

virt9 commented 4 months ago

@mathis-lambert thanks a lot , i will try it later

virt9 commented 4 months ago

@mathis-lambert sorry to bother you again, i try bf16 and after 1200steps the loss is 0 again ,and i found a solution to higher my batch_size, i make it from 1 to 2 ,but cuda out of memory error comes again. did you meet this? thanks!

AIR-hl commented 4 months ago

sorry, i have no idea. maybe you should make an new issue again 

---Original--- From: @.> Date: Sun, May 5, 2024 11:09 AM To: @.>; Cc: "Shihyueh @.**@.>; Subject: Re: [huggingface/trl] Cuda OutOfMemory error for DPO Finetuning(Issue #1024)

@mathis-lambert sorry to bother you again, i try bf16 and after 1200steps the loss is 0 again ,and i found a solution to higher my batch_size, i make it from 1 to 2 ,but cuda out of memory error comes again. did you meet this? thanks!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

virt9 commented 4 months ago

OK, thanks a lot