Open mano3-1 opened 5 months ago
Are you training on embed_tokens
and lm_head
?
Hi @danielhanchen,
Thank you for your response. I'm unsure about the inner workings of get_peft_model in Unsloth, but assuming it functions similarly to other peft methods, it should freeze the base model, including the embedding matrix, correct? Consequently, I believe my scripts are only training the Lora parameters. I attempted to use Unsloth's fix_untrained_tokens, but it didn't work out for me. Additionally, I noticed that Unsloth's blog mentions the llama-3-8b base model, whereas I'm using the llama-3-8b-instruct model. Instruct model's reserved tokens should not arise any issues as they are finetuned (unlike base model) right?
@mano3-1 what does the traceback say if you run
with torch.autograd.detect_anomaly():
trainer.train()
Hi @lapp0, Here is the traceback:
Traceback (most recent call last):
File "/home/ubuntu/LLMOps/train/train.py", line 501, in <module>
main()
File "/home/ubuntu/LLMOps/train/train.py", line 497, in main
training_function(args)
File "/home/ubuntu/LLMOps/train/train.py", line 445, in training_function
trainer.train()
File "/opt/conda/envs/LLMOps/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 361, in train
output = super().train(*args, **kwargs)
File "/opt/conda/envs/LLMOps/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "<string>", line 361, in _fast_inner_training_loop
File "/opt/conda/envs/LLMOps/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step
self.accelerator.backward(loss)
File "/opt/conda/envs/LLMOps/lib/python3.10/site-packages/accelerate/accelerator.py", line 2013, in backward
loss.backward(**kwargs)
File "/opt/conda/envs/LLMOps/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/opt/conda/envs/LLMOps/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'Fast_CrossEntropyLossBackward' returned nan values in its 0th output.
I'm running into issues with back-propagation in unsloth as well, albeit I'm using a custom loss function and Mistral instead of llama-3. It works fine for AutoModelForCausalLM
& get_peft_model
, but with unsloth
I get
`RuntimeError: Function 'LoRA_MLPBackward' returned nan values in its 0th output.
File "<string>", line 361, in _fast_inner_training_loop
File "/opt/conda/lib/python3.10/site-packages/trl/trainer/policy_trainer_base.py", line 549, in training_step
return super().training_step(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2013, in backward
loss.backward(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
return user_fn(self, *args)
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 142, in decorate_bwd
return bwd(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/unsloth/models/_utils.py", line 348, in backward
torch.autograd.backward(output, dY)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LoRA_MLPBackward' returned nan values in its 0th output.
I'd be interested in the cause of your issue, perhaps it is the same as mine. If I figure anything out with mine I'll let you know.
Hi @lapp0 Seems like we both are facing similar issue. I tried removing unsloth from my code and trained it with huggingface utilities, it went well. But I seriously want to have this unsloth in the loop, because the memory boost is significant. Do you think this is from unsloth's side or something which is popping due to our scripts?
I'm not sure. Your backwards step where it fails is a different layer of the model than me, but the only thing our scripts have in common is unsloth.
How about some debug details?
1) Could you please share a full reproduction script, which would allow me and daniel to run locally? This includes the whole source file along with your run command.
2) What is the output of pip3 freeze
Here is the pip freeze: requirements.txt
Here is the full training script: link
This is how I trigger the training scripts:
python train.py --max_seq_length 4000 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --sm_train_dir "/opt/ml/processing/train" --sm_validation_dir "/opt/ml/processing/test" --hf_token <yourtoken> --run_experiment False --lora_r 32 --lora_alpha 8 --unsloth True --logging_steps 8 --save_steps 8
you may set hf_token to string "None", if you are loading unsloth models I guess.
requirements.txt isn't the same as pip freeze. pip3 freeze
will detail the version of all packages.
Oh no sorry guys - i will take a look
Thanks @danielhanchen
Here is my reproduction script as well, run on a 4090 with cuda 12.1. @mano3-1 has a standard SFT script so his is probably worth looking at first.
"""
INSTALL DEPS:
pip install torch==2.3.0
pip install transformers tensorboardX bitsandbytes peft accelerate flash_attn --upgrade
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "git+https://github.com/lapp0/trl.git@ppov2"
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121
pip install torch==2.3.0 # ensure correct torch still used
"""
import multiprocessing
from datasets import load_dataset
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
PreTrainedModel,
DataCollatorWithPadding,
BitsAndBytesConfig
)
import torch
from trl.trainer.ppov2_trainer import PPOConfig, PPOTrainer, PolicyAndValueWrapper
from peft import get_peft_model, LoraConfig
base_model_uri = "HuggingFaceH4/mistral-7b-sft-beta"
reward_model_uri = "weqweasdas/RM-Mistral-7B"
################
# Model & Tokenizer
################
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(
base_model_uri,
padding_side="left",
trust_remote_code=True,
)
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
reward_model: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
reward_model_uri,
num_labels=1,
quantization_config=quantization_config,
attn_implementation="flash_attention_2",
)
value_model: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
reward_model_uri,
num_labels=1,
quantization_config=quantization_config,
attn_implementation="flash_attention_2",
)
value_model = get_peft_model(
value_model,
LoraConfig(
r=16,
lora_alpha=64,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="CAUSAL_LM",
)
)
from unsloth import FastLanguageModel
base_policy, _ = FastLanguageModel.from_pretrained(
model_name=base_model_uri,
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
base_policy = FastLanguageModel.get_peft_model(
base_policy,
r=16,
lora_alpha=64,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
use_gradient_checkpointing="unsloth",
max_seq_length=2048
)
"""
# Creating base_policy like this works, unsloth doesn't
from transformers import AutoModelForCausalLM
base_policy = AutoModelForCausalLM.from_pretrained(
base_model_uri,
num_labels=1,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
),
attn_implementation="flash_attention_2",
)
lora_config = LoraConfig(
r=16,
lora_alpha=64,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="CAUSAL_LM",
)
base_policy = get_peft_model(base_policy, lora_config)
"""
# trl.trainer.peft_module_casting_to_bf16(base_model)
base_model = PolicyAndValueWrapper(base_policy, value_model)
################
# Dataset
################
raw_datasets = load_dataset("HuggingFaceH4/ultrachat_200k")
train_dataset = raw_datasets["train_sft"]
eval_dataset = raw_datasets["test_sft"]
def prepare_dataset(dataset, tokenizer):
"""pre-tokenize the dataset before training; only collate during training"""
def tokenize(element):
input_ids = tokenizer.apply_chat_template(
element["messages"][:1],
padding=False,
add_generation_prompt=True,
)
return {"input_ids": input_ids, "lengths": len(input_ids)}
return dataset.map(
tokenize,
remove_columns=dataset.column_names,
num_proc=multiprocessing.cpu_count(),
load_from_cache_file=False,
)
train_dataset = prepare_dataset(train_dataset, tokenizer).filter(lambda x: x["lengths"] <= 1024)
eval_dataset = prepare_dataset(eval_dataset, tokenizer).filter(lambda x: x["lengths"] <= 1024)
collator = DataCollatorWithPadding(tokenizer)
###############
# Training
################
config = PPOConfig(
output_dir="./ppov2_experiment_v2",
report_to="tensorboard",
update_generation_steps=16,
gradient_accumulation_steps=8,
per_device_train_batch_size=2,
push_to_hub=True,
hub_model_id="lapp0/ppov2_experiment_v2",
logging_steps=1,
learning_rate=3e-6,
save_steps=4,
non_eos_penalty=True,
response_length=128,
optim="paged_adamw_8bit",
bf16=True,
fp16=False,
truncate_token="eos",
gradient_checkpointing=True,
# gradient_checkpointing_kwargs={"use_reentrant": False},
)
trainer = PPOTrainer(
model=base_model,
args=config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
reward_model=reward_model,
data_collator=collator,
tokenizer=tokenizer,
)
with torch.autograd.detect_anomaly():
trainer.train()
trainer.generate_completions()
pip3 freeze
:
Hi @lapp0 ,
Although I named it requirements.txt, I have extracted it by doing pip freeze
. Kindly check the file, you will find versions of all the libraries
Sorry about my confusion @mano3-1
I reviewed and compared our installed packages. Nothing noteworthy in the shared dependencies, other than perhaps the issue is related to the use of xformers. Will experiment with this later.
Thanks for the code repro - will test this out - sorry on the issue again!
Also facing same issue. While using colab and the standard notebook in the unsloth folder. Thought to add.
hey, I'm curious if someone has figured out a fix to this?
Sorry guys just started debugging this. I also updated Unsloth, so maybe it might be better (hopefully). For local installations, please update Unsloth via
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
For Colab / Kaggle should be fine with a restart
@DementedWeasel1971 When you said the colab notebook we provided broke, could you point to exactly which one thanks.
@mano3-1 Extremely weird actually - I reran Colab with Instruct and it seems fine - would you be able to run just the conversational notebook for Llama-3 here: https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing
@lapp0 I'm currently running your PPO example here: https://colab.research.google.com/drive/1fgJv0eKlRKexOl2RqcxoiZ-HhGrdNWQW?usp=sharing (will wait for it to complete)
Thank so much for looking into it! Unfortunately I'm still getting nan
on the first training step:
{'loss': 1.9125, 'grad_norm': nan, 'learning_rate': 2.9999207167208437e-06, 'objective/kl': 0.0, 'objective/entropy': 99.8125, 'objective/non_score_reward': 0.0, 'objective/rlhf_reward': -0.58380126953125, 'objective/scores': -0.58380126953125, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -6.116561479529992e-09, 'loss/value_avg': 19.12525177001953, 'val/clipfrac_avg': 0.0011788890697062016, 'val/num_eos_tokens': 0.5, 'timer/training_step': 2.293384313583374, 'epoch': 0.0}
Please let me know if there's any other debug details that would help.
Also fyi, to speed up debugging you can set update_generation_steps=1
.
Edit:
I pushed a bad commit to my branch, I reverted the broken change. Should be good to try again with head of https://github.com/lapp0/trl.git@ppov2
.
Hi, I followed @danielhanchen 's notebook and compared the parameters with mine. When I change the optimizer from paged_adamw_32bit to "adamw_8bit", the nan issues are not coming up.
@lapp0 I can see paged adam in your script, perhaps change it to adamw_8bit and try it over.
@mano3-1 I changed it, but it seems like it's still nans weirdly. I also tried installing the February release of Unsloth with torch==2.1.1, and it's not working.
@lapp0 Do you know when it was last working? (which Unsloth version)
@mano3-1 I just tried using a non-paged optimizer as you suggested, but unfortunately it didn't resolve the issue.
@danielhanchen this is a brand new trainer adapted from https://github.com/huggingface/trl/pull/1540 based on https://arxiv.org/pdf/2403.17031
It hasn't ever been run successfully with unsloth before, but runs with peft + BnB. Shouldn't the forward and backward pass be identical to peft + BnB, or are there some steps where precision loss occurs?
@mano3-1 @danielhanchen it's interesting that mano isn't getting nan, but you are. Perhaps there is something different between your environment?
Here's mine for context:
@mano3-1 Wait if it works for you - weird it might be a weird paged optimizer issue.
@lapp0 HMm very weird indeed - yes I only edit the forward and backward passes, but I'm assuming the wrapping mechanisms are causing issues ie PolicyAndValueWrapper
maybe - the best way is to inject nan gradient checks throughout the entire code base to pinpoint the issue.
If I had to guess, the Cross Entropy Loss part is causing issues, since I manually shift the labels and append stuff, so maybe it might be causing issues.
I also turned off unsloth
gradient checkpointing, and it still doesnt work
@danielhanchen one other thing that isn't tried / tested by the Unsloth community is interleaving training and generating, which this script does. I have a feeling that is a possible culprit. I'll experiment with training only using pre-generated samples when I get a chance.
Also I don't think the PolicyAndValueWrapper
is the issue, I have another variant without any value model. It has mostly similar code and has nan grad_norm
.
For nan gradient checks, I already am running with
with torch.autograd.detect_anomaly():
trainer.train()
Do you know a good way to inject hooks which apply more extensive and detailed nan checks?
Edit: I was mistaken about the source of the problem. However I did discover that if my per_device_batch_size
is 1, I don't get the error. I'm not sure what the reason might be.
@lapp0 Apologies on the delay! Ok weird so it might be something related to batching. Weird. Do you know if generation also uses per_device_batch_size
internally?
@danielhanchen I'm pretty confident that the issue relates to padding now. The error doesn't occur with batch size N > 1
if the sequences are the same length (no padding). The code sets the logits indices which aren't attended to as an illegal value.
INVALID_LOGPROB = 1.0
...
def forward(self, model, query_responses):
attention_mask = query_responses != self.tokenizer.pad_token_id
position_ids = attention_mask.cumsum(1) - attention_mask.long()
input_ids = torch.masked_fill(query_responses, ~attention_mask, 0)
return model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
return_dict=True,
output_hidden_states=True,
use_cache=False,
)
...
output = self.forward(self.model, query_responses)
output.logits.mean().backward(retain_graph=True)
logits = output.logits[:, context_length - 1: -1]
logits /= self.args.temperature + 1e-7
new_all_logprobs = F.log_softmax(logits, dim=-1)
new_logprobs = torch.gather(new_all_logprobs, 2, responses.unsqueeze(-1)).squeeze(-1)
new_logprobs = torch.masked_fill(
new_logprobs, padding_mask, INVALID_LOGPROB
)
I'm wondering whether Unsloth includes logits which aren't included by the attention mask in the backward pass?
I'll do some more experimentation.
I found the issue and created a reproduction script! https://github.com/unslothai/unsloth/issues/533
Thanks for the investigation - ill take a look!
Hi,
I'm currently fine-tuning llama3-instruct-8b on a custom dataset using unsloth's FastLanguageModel. I'm using Hugging Face's SFTTrainer to train the model. Surprisingly, the gradient norm and evaluation loss become NaN after a few steps. I've seen a blog from unsloth mentioning that NaNs may appear due to a bug, but they also mentioned that the bug was fixed by Hugging Face and unsloth now (here, under the llama3-Quirks section). So, I not only updated unsloth and Hugging Face but also added the "pad_token" mentioned in the blog. Despite these attempts, the NaN problem still persists. Is there something else that I'm missing? Can someone help me out with this?
Here's the code snippet of how I'm loading the model:
Following is the training code: