Adding New Tokens - Githubissues

StrangePineAplle commented 3 weeks ago

Hello, thank you for your work first. I'm trying to add a few tokens to fine-tune the model afterward, but I'm facing a few errors. First, I downloaded the model:

max_seq_length = 4096  # сменить на максимальную длину из датасета
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.
model_name = "unsloth/Meta-Llama-3.1-8B"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

Then I added tokens:

new_tokens = ["<|START_COEFFICIENTS|>", "<|END_COEFFICIENTS|>", "<|SPACE_COEFFICIENTS|>",
              "<|START_GENES|>", "<|END_GENES|>", "<|SPACE_GENES|>"]

add_new_tokens(model, tokenizer, new_tokens=new_tokens)
model.resize_token_embeddings(len(tokenizer))

But I got an error:

RuntimeError: Setting requires_grad=True on inference tensor outside InferenceMode is not allowed.

Then I initialized the QLoRa model and trained it. If I add tokens to the model with QLoRa:

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",      # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None, # And LoftQ
)

new_tokens = ["<|START_COEFFICIENTS|>", "<|END_COEFFICIENTS|>", "<|SPACE_COEFFICIENTS|>",
              "<|START_GENES|>", "<|END_GENES|>", "<|SPACE_GENES|>"]

add_new_tokens(model, tokenizer, new_tokens=new_tokens)
model.resize_token_embeddings(len(tokenizer))

I will not have this error, but I will get this error in the training function:

RuntimeError: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd.

I am very confused by this. Can you explain where I should add new tokens or should I use special reserved tokens instead?

danielhanchen commented 3 weeks ago

Oh you must always call add_new_tokens BEFORE .get_peft_model!!! See https://github.com/unslothai/unsloth/wiki#adding-new-tokens

StrangePineAplle commented 3 weeks ago

Thanks, but what about changing the embedding size of the model after adding a new token to the tokenizer? Do I need to do it, or does it happen inside add_new_tokens?

Erland366 commented 3 weeks ago

yes it's happening inside the add_new_tokens :D

StrangePineAplle commented 3 weeks ago

I know this isn't the best place to ask, but I have one more question about the tokenizer itself. I want to add custom tokens to mark the start, end, and separator between table data that I'm adding to the prompt when fine-tuning the model. I understand that custom tokens are helpful for specific terms, but can they also help the model better understand data structure? I couldn't find any direct answers to that, so I would be endlessly grateful for any information about this and any other use cases for custom tokens.

Erland366 commented 3 weeks ago

I think generally it's hard to add new tokens because in the pretraining phase which consume trillion of tokens, the model never seen those tokens. I think if possible, just use like JSON, or add a lot of data so the model can learn from the new token.

alexkstern commented 3 days ago

Hey @StrangePineAplle I am currently also working on a fine tune using a few new tokens for a classification problem. However, I am getting exploding gradients and loss is just going up. Did you face these same problems? Thanks :)

unslothai / unsloth

Adding New Tokens #1223