Open danielhanchen opened 1 month ago
First time here,
I looked, and AFAIK, Qwen uses a different tokenization method (BPE on UTF-8 bytes) and has a specific way of handling unique tokens. Also, Qwen has extra special tokens from <|extra_0|>
to <|extra_204|>
for custom use.
https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md
I just added a new way to add new tokens - see https://github.com/unslothai/unsloth/wiki#adding-new-tokens
Unsloth has a function called add_new_tokens
which allows you to add new tokens to your finetune. For example if you want to add <CHARACTER_1>
, <THINKING>
and <SCRATCH_PAD>
we can do the following:
model, tokenizer = FastLanguageModel.from_pretrained(...)
from unsloth import add_new_tokens
add_new_tokens(model, tokenizer, new_tokens = ["<CHARACTER_1>", "<THINKING>", "<SCRATCH_PAD>"])
model = FastLanguageModel.get_peft_model(...)
Note - you MUST always call add_new_tokens
before FastLanguageModel.get_peft_model
!
I add new tokens and fine tuned but in inferance I get error with torch size. My checkpoint torch.Size([128259, 4096]), my model size torch.Size([128256, 4096]), how to fix this
From Twitter - adding new tokens to Qwen don't work?