unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.37k stars 1.28k forks source link

Resizing tokenizer leads to missing end token and garbage response? #1273

Open Mark-DelGrande opened 1 week ago

Mark-DelGrande commented 1 week ago

I am using a chat ML template like this to format prompts:

def format_conversation(examples):
    conversations = examples['conversation']
    texts = []
    for convo in conversations:
        conversation_text = ''
        for turn in convo:
            role = turn['role']
            content = turn['content']
            # Format each turn using ChatML
            if role == 'user':
                conversation_text += f"<|im_start|>user\n{content}<|im_end|>\n"
            elif role == 'assistant':
                conversation_text += f"<|im_start|>assistant\n{content}<|im_end|>\n"
        texts.append(conversation_text)
    return {'text': texts}

when I use to resize with this code I got back the response:

special_tokens_dict = {'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}
tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Embedding(128258, 4096)

Now I am getting back:

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
Embedding(128258, 4096, padding_idx=128004)

Not sure if this is related but it feels like I t might be an di tired to turn False on and it still gave me back

Embedding(128258, 4096, padding_idx=128004)

After fine-tuning Llama 3.1 with the same code my responses went from something like this:

Description: Active use case
Time left: 12:00

This is what I would like to get out of it and looks like my data I fine tuned on but it has become:

Description: Active use case
Time left: 12:00actionDate
.<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://
�
�
팬
<|end_of_text|><|begin_of_text|>://
�
<|end_of_text|><|begin_of_text|>://
ி
고
")));<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://
안
"]);<|end_of_text|><|begin_of_text|>://
토크
��
<|end_of_text|><|begin_of_text|>://
t
o
")));<|end_of_text|><|begin_of_text|>://
y
i
"]);<|end_of_text|><|begin_of_text|>://
현재
")));<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://

프
")));<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://
")));<|end_of_text|><|begin_of_text|>://
멘
")));"),"...
�
actionDate
 ActiveForm

Anyone have any ideas if something changed, and how I can get my end token to be caught again?

danielhanchen commented 1 week ago

Oh wait please use https://github.com/unslothai/unsloth/wiki#adding-new-tokens ie

model, tokenizer = FastLanguageModel.from_pretrained(...)
from unsloth import add_new_tokens
add_new_tokens(model, tokenizer, new_tokens = ["<CHARACTER_1>", "<THINKING>", "<SCRATCH_PAD>"])
model = FastLanguageModel.get_peft_model(...)