fix/load-checkpoint-add-new-tokens

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

18.4k stars 1.29k forks source link

fix/load-checkpoint-add-new-tokens #1225

Open Erland366 opened 3 weeks ago

Erland366 commented 3 weeks ago

https://github.com/unslothai/unsloth/issues/1215

Given this issue where we can't immediately use the changed vocab size because the difference size between the adapter and base model, we need to resize the base model before merging the LoRA into base model.

Note this need changes to the unsloth-zoo since we need a modification of it. which I also create a PR of it

https://github.com/unslothai/unsloth-zoo/pull/9

Erland366 commented 3 weeks ago

I need a discussion about the embedding tho since I did not implement specification to specify the method to extend the embedding. So for example, when training the embedding, the user specify to use interpolation. Then when we load the checkpoint and resize the base model again, we need to make sure that the resize method is the same as in training.

Maybe we can store additional params in the model.config of the method? then we can pass it when we load the checkpoint and resize?

Erland366 commented 3 weeks ago

Also while here, seems like the value of tokenizer.vocab_size is unchanged when we do add_new_tokens. Is tokenizer.vocab_size only consider non special tokens and since we add all of the new tokens to the special tokens, that's why the attribute value is not increasing?

Erland366 commented 3 weeks ago

https://colab.research.google.com/drive/1xBxY_L48Lzu5SJjukPExgoWVthoyTGCA?usp=sharing

reproducible of this fix