ValueError: Unsloth: Untrained tokens found, but embed_tokens & lm_head not trainable, causing NaNs. when finetuning llama3 on customed dataset

liwd190019 commented 4 months ago

I want to implement llama 3 on multi-turn dialogue task, so I was trying to finetune it on one of my customed dataset, which is made by simply extracting all the dialogue contents from soda and reformat it to be in llama3's chat template.

I think the result of this dataset is quite similar to the original. But when I tried to train the model, I got the following error:

ValueError: Unsloth: Untrained tokens found, but embed_tokens & lm_head not trainable, causing NaNs. Restart then add `embed_tokens` & `lm_head` to `FastLanguageModel.get_peft_model(target_modules = [..., "embed_tokens", "lm_head",])`

As it indicates, it's because there're some untrainable tokens. Though this bug can be fixed by following the hints, I just can't figure out why I introduced those untrainable tokens. After all, the two datasets (my customed one and the default one in the notebook) look very similar.

Here is a link to the colab, feel free to comment and give advice!

liwd190019 commented 4 months ago

Also can anyone suggest a good inference method to test the finetuned multi-turn chatbot? Currently, the inference code in the notebook for chatbot only supports 1-turn.

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

sumukshashidhar commented 4 months ago

I started to get this error today too, strange...

dmitrii-palisaderesearch commented 4 months ago

Try using the Instruct model ("unsloth/llama-3-8b-Instruct-bnb-4bit").

danielhanchen commented 4 months ago

@sumukshashidhar @liwd190019 Apologies! As @dmitrii-palisaderesearch mentioned, please use the instruct version - using the base version will error out, since Unsloth does auto checking if some tokens are all 0s - if you still want to use the base model, either do not use the llama-3 chat template (just use Alpaca), or train on lm_head and embed_tokens

enesbol commented 3 months ago

@danielhanchen Do you know why it started to throw this error when it was successfully training with the base model a couple of months ago?

danielhanchen commented 3 months ago

I added a check in Unsloth to check if your embeddings are untrained - I might have to change the logic actually

milsun commented 1 month ago

any update on this, am getting this error too for a base model?

paraschopra commented 1 month ago

@danielhanchen i am getting this error, i was following your notebook on instruct finetuning (am not using base model). my dataset contains latex symbols. is that so?

what do you mean by: "if your embeddings are untrained"?

my data is generated fromllama only.

thusinh1969 commented 1 month ago

Same here. I was finetuning LlaMA-3.1-7B w/ QLoRA normally, then I use Unsloth to continue the training with longer context.

There are 2 problems actually:

1. Can not resume using UnslothTrainer:

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

from unsloth import FastLanguageModel
max_seq_length = data_args.max_seq_length
dtype = torch_dtype
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = training_args.pretrain_qlora_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,

)

unsloth_args = UnslothTrainingArguments(
          per_device_train_batch_size = training_args.per_device_train_batch_size,
          gradient_accumulation_steps = training_args.gradient_accumulation_steps,
          max_steps     = training_args.max_steps,
          save_steps    = training_args.save_steps,
          logging_steps = training_args.logging_steps,
          warmup_steps  = training_args.warmup_steps,
          save_total_limit= training_args.save_total_limit,
          num_train_epochs = training_args.num_train_epochs,
          learning_rate = training_args.learning_rate,
          lr_scheduler_type = training_args.lr_scheduler_type,
          weight_decay = training_args.weight_decay,
          gradient_checkpointing = training_args.gradient_checkpointing,
          embedding_learning_rate = training_args.embedding_learning_rate,
          fp16 = training_args.fp16,
          bf16 = training_args.bf16,
          tf32 = True,
          seed = 3407,
          output_dir = training_args.output_dir,
          dataloader_num_workers = training_args.dataloader_num_workers,
          ddp_find_unused_parameters = training_args.ddp_find_unused_parameters,
          overwrite_output_dir = training_args.overwrite_output_dir,
          ignore_data_skip = training_args.ignore_data_skip, # Important to start Unsloth from datapoint 0
          prediction_loss_only = training_args.prediction_loss_only,
          evaluation_strategy  = training_args.evaluation_strategy
      )

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    args = unsloth_args,
    train_dataset = train_dataset,
    #dataset_text_field = "text",
    max_seq_length = data_args.max_seq_length,
    dataset_num_proc = 8,
    packing = True, # Can make training 5x faster for short sequences.
    optimizers=(optimizer_adamw, lr_scheduler_adamw),
)

resume_ckp = "./QLora_finetune_with_embed_lmhead/checkpoint-1200"
train_result = trainer.train(resume_from_checkpoint=resume_ckp )

--> ValueError: Unsloth: Untrained tokens of [[]] found, but embed_tokens & lm_head not trainable, causing NaNs. Restart then add embed_tokens & lm_head to FastLanguageModel.get_peft_model(target_modules = [..., "embed_tokens", "lm_head",]).Are you using the base model? Instead, use the instruct version to silence this warning.

Any idea why ? I was doing OK a few days back I think ...!

2. Weird warning if resume by Transformers' Trainer instead of UnslothTrainer: Just a side note, while resuming Ok with Transformer's Trainer, I also got warning RNG file not found although the 2 files "rng_state_0.pth" and "rng_state_1.pth" were there in the checkpoint folder ! The resume will continue to train but God know if the quality is OK or not so we stopped. The entire resume also need to look at please @danielhanchen

Thanks, Steve

danielhanchen commented 4 weeks ago

@milsun Apologies I saw you guys also commented on the other issue - hope it got partially resolved - sorry on the delay as well!

@paraschopra It's possible extra symbols might be breaking the tokenizer, but I'm unsure - these can cause NaN gradients, so I error out. Could you open a separate issue so I can look into it more - thanks!

@thusinh1969 Apologies on the delay as well - saw your other issue as well!

unslothai / unsloth

ValueError: Unsloth: Untrained tokens found, but embed_tokens & lm_head not trainable, causing NaNs. when finetuning llama3 on customed dataset #658