Resize embeddings, tokenizers - adding new tokens don't work

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

18.15k stars 1.27k forks source link

Resize embeddings, tokenizers - adding new tokens don't work #1108

Open danielhanchen opened 1 month ago

danielhanchen commented 1 month ago

From Twitter - adding new tokens to Qwen don't work?

# Add special tokens to the tokenizer
num_added_tokens = tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})

# Resize token embeddings of the model
model.resize_token_embeddings(len(tokenizer))

Selich commented 4 weeks ago

First time here,

I looked, and AFAIK, Qwen uses a different tokenization method (BPE on UTF-8 bytes) and has a specific way of handling unique tokens. Also, Qwen has extra special tokens from <|extra_0|> to <|extra_204|> for custom use.

https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md

danielhanchen commented 4 weeks ago

I just added a new way to add new tokens - see https://github.com/unslothai/unsloth/wiki#adding-new-tokens

Unsloth has a function called add_new_tokens which allows you to add new tokens to your finetune. For example if you want to add <CHARACTER_1>, <THINKING> and <SCRATCH_PAD> we can do the following:

model, tokenizer = FastLanguageModel.from_pretrained(...)
from unsloth import add_new_tokens
add_new_tokens(model, tokenizer, new_tokens = ["<CHARACTER_1>", "<THINKING>", "<SCRATCH_PAD>"])
model = FastLanguageModel.get_peft_model(...)

Note - you MUST always call add_new_tokens before FastLanguageModel.get_peft_model!

risqaliyevds commented 3 weeks ago

I add new tokens and fine tuned but in inferance I get error with torch size. My checkpoint torch.Size([128259, 4096]), my model size torch.Size([128256, 4096]), how to fix this