Adding New Vocabulary Tokens to the Models

vyraun commented 5 years ago

❓ Questions & Help

Hi,

How could I extend the vocabulary of the pre-trained models, e.g. by adding new tokens to the lookup table?

Any examples demonstrating this?

LysandreJik commented 5 years ago

Hi, I believe this method does exactly what you're looking for: add_tokens. There's an example right below it.

vyraun commented 5 years ago

thanks @LysandreJik ! yes, that's exactly what I was looking for. A follow-up question: How could I initialize the embeddings of these "new tokens" to something I already have pre-computed? I assume currently, embedding for these new tokens will be randomly initialized.

LysandreJik commented 5 years ago

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

vyraun commented 5 years ago

thanks @LysandreJik ! That should solve it quite neatly. I will reopen the issue in case I run into any issues.

celsofranssa commented 4 years ago

Hello @LysandreJik ,

What is the difference between the following approaches?

to train a tokenizer from scratch such as pointed in hugginface blog; or
to use add_tokens method?

Thank you in advance.

LysandreJik commented 4 years ago

Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data).

Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens.

PieterDujardin commented 4 years ago

@LysandreJik I have a dutch medical dataset (for Namen Entity Recognition) which contains a lot of domain-specific words. The dutch BERT tokenizer therefor outputs a lot of [UNK] tokens when it tokenizes. Given that I dispose over a corpus of 60k labelled tokens, and right now I have also a relatively small annotated corpus of 185k tokens, would it be best to:

just add the most frequent out of vocab words to the vocab of the tokenizer
start from a BERT checkpoint and do further pretraining on the unlabeled dataset (which is now of size 185k which is pretty small I assume..). There might be a possibility for me to obtain a much larger unannotated dataset of potentially millions of (unlabelled) tokens, but I was wondering if even millions of tokens is enough to do some meaningful further pretraining?

Thanks!

vinayannam commented 4 years ago

Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data).

Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens.

Hey I would like to fine-tune the model as you suggested at the end to the dataset containing such tokens. Can you help me out on how I can do that?

crispin-nosidam commented 4 years ago

If I add unknown tokens to the tokenizer and train the model on, say sentence pair similarity, while I suppose the new tokens embeddings will not have the correct relationship with other tokens, will the model output still be able to find similarity correctly given sufficient training on the model?

JensMadsen commented 4 years ago

@LysandreJik Thank you for your suggestion. However, I run into trouble because altering the embedding turns the embedding tensor into a non-leaf tensor and hence cannot be optimized i.e.

model.embeddings.word_embeddings.weight.is_leaf # False

I cannot figure out how to fix this (I am torch beginner; sorry). Do you have any suggestions?

vjagannath786 commented 4 years ago

facing same issue; getting false for is_leaf

HenryPaik1 commented 3 years ago

BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True).get_vocab() not return added token. How can I check if the new token is properly added to vocab dictionary?

ReySadeghi commented 3 years ago

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

Hi, I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it. it may have lag or problem... what should I do?

zellford commented 3 years ago

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.
import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]
Hi, I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it. it may have lag or problem... what should I do?

Have you solved the problem? If so, can you share it with us?

ReySadeghi commented 3 years ago

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.
import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]
Hi, I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it. it may have lag or problem... what should I do?
Have you solved the problem? If so, can you share it with us?

yes, it was because it takes a very long time to add all tokens. and I installed transformers from source: pip install -U git+https://github.com/huggingface/transformers ,due to recently it was merged a PR that should speed this up dramatically and my problem solved.

zellford commented 3 years ago

thank you!

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2021年5月10日(星期一) 下午2:11 收件人: @.>; 抄送: "Patrick @.>; @.>; 主题: Re: [huggingface/transformers] Adding New Vocabulary Tokens to the Models (#1413)

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel. import torch from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained("bert-base-cased") model = BertModel.from_pretrained("bert-base-cased") print(len(tokenizer)) # 28996 tokenizer.add_tokens(["NEW_TOKEN"]) print(len(tokenizer)) # 28997 model.resize_token_embeddings(len(tokenizer)) # The new vector is added at the end of the embedding matrix print(model.embeddings.word_embeddings.weight[-1, :]) # Randomly generated matrix model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size]) print(model.embeddings.word_embeddings.weight[-1, :]) # outputs a vector of zeros of shape [768]

Hi, I tried this, but my code still stop in tokenizing the sentences section and doesn't pass it. it may have lag or problem... what should I do?

Have you solved the problem? If so, can you share it with us?

yes, it was because it takes a very long time to add all tokens. and I installed transformers from source: pip install -U git+https://github.com/huggingface/transformers ,due to recently it was merged a PR that should speed this up dramatically and my problem solved.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

ptheru commented 3 years ago

Training a tokenizer from scratch would imply training a model from scratch as well - depending on the corpus used for the tokenizer, the tokens may be entirely different from another model's tokens trained on a similar corpus (except if you train the tokenizer using the exact same method and the exact same data).

Adding tokens adds tokens at the end of the tokenizer's vocabulary, essentially extending the vocabulary. The model's embedding matrix would need to be resized as well to take into account the new tokens, but all the other tokens would keep their representation as-is. Seeing as the new rows in the embedding matrix are randomly initialized, you would still need to fine-tune the model to a dataset containing such tokens.

Why can't we repurpose the existing 999 unused tokens [UNK] instead of extending the vocab size? https://github.com/google-research/bert/issues/9#issuecomment-434796704

KairaNithin commented 3 years ago

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

@LysandreJik when I ran your code the following error popped up. please help

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

cm107 commented 3 years ago

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

You can fix that error by temporarily disabling gradient calculation. (Because initializing the weights is not an operation that needs to be accounted for in backpropagation.)

with torch.no_grad():
    model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

arkhan19 commented 2 years ago

why hidden_size? Is that specific to just Bert model? For Albert it should be different right?

pratikchhapolika commented 2 years ago

How do we initialise the pre-existing embeddings for new tokens from old partitioned tokens?

Kirili4ik commented 2 years ago

why hidden_size? Is that specific to just Bert model? For Albert it should be different right?

Hi, yes, I do believe the name can vary from model to model. For T5 model it seems to be d_model

Kirili4ik commented 2 years ago

How do we initialise the pre-existing embeddings for new tokens from old partitioned tokens?

If I understand you correctly, we can initialise new tokens from already pre-trained ones with taking a mean of them:

with torch.no_grad():
    for i, token in enumerate(reversed(added_tokens), start=1):
        tokenized = tokenizer.tokenize(token)
        tokenized_ids = tokenizer.convert_tokens_to_ids(tokenized)
        model.embeddings.word_embeddings.weight[-i, :] = model.embeddings.word_embeddings.weight[tokenized_ids].mean(axis=0)

pratikchhapolika commented 2 years ago

How do we initialise the pre-existing embeddings for new tokens from old partitioned tokens?

If I understand you correctly, we can initialise new tokens from already pre-trained ones with taking a mean of them:
with torch.no_grad():
    for i, token in enumerate(reversed(added_tokens), start=1):
        tokenized = tokenizer.tokenize(token)
        tokenized_ids = tokenizer.convert_tokens_to_ids(tokenized)
        model.embeddings.word_embeddings.weight[-i, :] = model.embeddings.word_embeddings.weight[tokenized_ids].mean(axis=0)

Ok. Thank you. Is this also correct?

model.resize_token_embeddings(len(tokenizer))
weights = model.roberta.embeddings.word_embeddings.weight

# initialize new embedding weights as mean of original tokens
with torch.no_grad():
    emb = []
    for i in range(len(joined_keywords)):
        word = joined_keywords[i]
        # first & last tokens are just string start/end; don't keep
        tok_ids = tokenizer_org(word)["input_ids"][1:-1]
        tok_weights = weights[tok_ids]

        # average over tokens in original tokenization
        weight_mean = torch.mean(tok_weights, axis=0)
        emb.append(weight_mean)
    weights[-len(joined_keywords):,:] = torch.vstack(emb).requires_grad_()

pratikchhapolika commented 2 years ago

How should I save new tokenizer to use it in downstream model?

tokenizer_org = tr.BertTokenizer.from_pretrained("/home/pc/bert_base_multilingual_uncased")
tokenizer.add_tokens(joined_keywords)
model = tr.BertForMaskedLM.from_pretrained("/home/pc/bert_base_multilingual_uncased", return_dict=True)

# prepare input
text = ["Replace me by any text you'd like"]
encoded_input = tokenizer(text, truncation=True, padding=True, max_length=512, return_tensors="pt")
print(encoded_input)

# add embedding params for new vocab words
model.resize_token_embeddings(len(tokenizer))
weights = model.bert.embeddings.word_embeddings.weight

# initialize new embedding weights as mean of original tokens
with torch.no_grad():
    emb = []
    for i in range(len(joined_keywords)):
        word = joined_keywords[i]
        # first & last tokens are just string start/end; don't keep
        tok_ids = tokenizer_org(word)["input_ids"][1:-1]
        tok_weights = weights[tok_ids]

        # average over tokens in original tokenization
        weight_mean = torch.mean(tok_weights, axis=0)
        emb.append(weight_mean)
    weights[-len(joined_keywords):,:] = torch.vstack(emb).requires_grad_()

model.to(device)

trainer.save_model("/home/pc/Bert_multilingual_exp_TCM/model_mlm_exp1")

It saves model, config, training_args. How to save the new tokenizer as well??

ShaoMinLiu-Holmusk commented 2 years ago

I am not sure if anyone can help to answer this here but I cannot seems to be able to find an answer from anywhere: what exactly is the difference between "token" and a "special token"?

I understand the following:

what is a typical token
what is a typical special token: MASK, UNK, SEP, etc
when do you add a token (when you want to expand your vocab)

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens? If an example uses a special token, why can't a normal token achieve the same objective?

tokenizer.add_tokens(['[EOT]'], special_tokens=True)

And I also dont quite understand the following description in the source documentation. what difference does it do to our model if we set add_special_tokens to False?

add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

Giovani-Merlin commented 2 years ago

I am not sure if anyone can help to answer this here but I cannot seems to be able to find an answer from anywhere: what exactly is the difference between "token" and a "special token"?

I understand the following:

what is a typical token

what is a typical special token: MASK, UNK, SEP, etc

when do you add a token (when you want to expand your vocab)

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens? If an example uses a special token, why can't a normal token achieve the same objective?
tokenizer.add_tokens(['[EOT]'], special_tokens=True)
And I also dont quite understand the following description in the source documentation. what difference does it do to our model if we set add_special_tokens to False?
add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

When you add a "special token" it will not be replaced by the "[MASK]" or replaced by a random word in the pre-training procedure.

TeKett commented 10 months ago

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

Has anything changed in the past 4 years and how would one do this with a custom / self trained / specialised model? I wanted to add some more tokens to help with training and prompting, so that it doesn't split words it don't know into multiple tokens and in turn damage concepts it already knows or generate garbage.

ArthurZucker commented 10 months ago

Hey! Nothing much is different in terms of code, we leave it to the user to define the new embeddings, but bunch of tutorials give good ideas of how to do this well: https://nlp.stanford.edu/~johnhew/vocab-expansion.html

SuperBruceJia commented 10 months ago

You are right, these tokens will be randomly initialized. What I would do if I wanted to assign new values to this embedding (as an initialization), is to directly change the Embeddings weight. Here's an example with the BertModel.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("bert-base-cased")

print(len(tokenizer))  # 28996
tokenizer.add_tokens(["NEW_TOKEN"])
print(len(tokenizer))  # 28997

model.resize_token_embeddings(len(tokenizer)) 
# The new vector is added at the end of the embedding matrix

print(model.embeddings.word_embeddings.weight[-1, :])
# Randomly generated matrix

model.embeddings.word_embeddings.weight[-1, :] = torch.zeros([model.config.hidden_size])

print(model.embeddings.word_embeddings.weight[-1, :])
# outputs a vector of zeros of shape [768]

In this way, I received a warning: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Should we write specific codes to fine-tune the word embedding?

Thank you very much!

Best regards,

Shuyue Nov. 27th, 2023

TeKett commented 10 months ago

Hey! Nothing much is different in terms of code, we leave it to the user to define the new embeddings, but bunch of tutorials give good ideas of how to do this well: https://nlp.stanford.edu/~johnhew/vocab-expansion.html

Cool, but i barely know programming. Esp not python, nor fancy stuff written and used by techwizards. To me this is like magic. I use A1111 to generate images, and i train checkpoints with kohya. Where does this fit into that so i can add more tokens? How and where do i get the tokenizer and model, is it the same as clip?

I tried to ask similar question specifically about SD on the SD subreddit, but the spam filter keeps deleting my post, with no explanation as to why. No wonder i dont find any one asking this question if reddit deletes all posts about it.

With the huge influx of new people to AI in the past year, i feel this stuff needs to be more readily available, and not "write this thousand lines of python code from this 2 lines of code as a hint", possibly by having either a program or plugin where you can just give it a text file. Feels like it would be more convenient even for the veterans, rather then the veterans gatekeeping it behind learning python for 5 years. I am not asking you to write it for me, but google might be fucking with me and not giving me what i search for, but someone has to have made a program to add new tokens to a checkpoint by now. I cant be the only person that would want to do this.

ArthurZucker commented 10 months ago

Hey! Totally feel your pain and sorry for the whole in the documentation for this! I'll open a pull request to make this more visible. Here is a snippet of how to do this:

from transformers import GPT2Tokenizer, GPT2Model
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")
num_new_tokens = tokenizer.add_tokens(["new_token_1", "new_token_2"], special_tokens=True)

# simple resize
model.resize_token_embeddings(len(tokenizer))

# overwrite the content to have better results
input_embeddings = model.wte.weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg

hope it helps! 🤗

SuperBruceJia commented 10 months ago

Hey! Totally feel your pain and sorry for the whole in the documentation for this! I'll open a pull request to make this more visible. Here is a snippet of how to do this:

from transformers import GPT2Tokenizer, GPT2Model
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")
num_new_tokens = tokenizer.add_tokens(["new_token_1", "new_token_2"], special_tokens=True)

# simple resize
model.resize_token_embeddings(len(tokenizer))

# overwrite the content to have better results
input_embeddings = model.wte.weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg

hope it helps! 🤗

Thank you very much for your suggestion!

Best regards,

Shuyue Nov. 29th, 2023

TeKett commented 10 months ago

So i tried having a crack at this but struggling allot. It is more complicated since what im looking for is stored in a .safetensor file. I tried spending a few hours using different AI chats to try and do this with no results. The follow code is all i got. I dont know how to actually get the state_dict due to the safetensor format complicating things, or how to save what i have done back to a fully functional safetensor file. Someone want to help me with this?

# Load the tokenizer and model
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

# Path to the .safetensor file
safetensor_file = 'checkpoint.safetensors'

# Load the model parameters from the .safetensor file
state_dict = ?????????????????????????????????????????

# Update the model's state_dict with the loaded parameters
model.load_state_dict(state_dict)

# Path to the text file containing new tokens (one token per line)
new_tokens_file = 'tokens.txt'

# Read the tokens from the text file
file = open('myfile.txt', 'r')
tokens = file.readlines()

# Add the tokens
for token in tokens:
    tokenizer.add_tokens([token])

model.resize_token_embeddings(len(tokenizer))

# Save the updated .safetensor file with the modified vocabulary
??????????????????????????????????????

ArthurZucker commented 10 months ago

@TeKett I don't understand what you are trying to do. Could you explain it to me so that I can give you the exact code snippet?

TeKett commented 10 months ago

@TeKett I don't understand what you are trying to do. Could you explain it to me so that I can give you the exact code snippet?

I want to add more tokens/vocabulary to a model, this case CLIP, thats packaged inside of a SD 1.5 checkpoint of the Safetensor variety. The safetensor file contains VAE, Unet, and CLIP.

ArthurZucker commented 10 months ago

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
import torch 

pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

pipe = StableDiffusionPipeline
num_new_tokens = pipeline.tokenizer.add_tokens(["new_token_1", "new_token_2"], special_tokens=True)

# simple resize
pipeline.text_encoder.resize_token_embeddings(len(pipeline.tokenizer))

# overwrite the content to have better results
input_embeddings = pipeline.text_encoder.get_input_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings)

should work 😉

TeKett commented 9 months ago

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
import torch 

pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

pipe = StableDiffusionPipeline
num_new_tokens = pipeline.tokenizer.add_tokens(["new_token_1", "new_token_2"], special_tokens=True)

# simple resize
pipeline.text_encoder.resize_token_embeddings(len(pipeline.tokenizer))

# overwrite the content to have better results
input_embeddings = pipeline.text_encoder.get_input_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings)

should work 😉

Can i give 'add_tokens' an array, or loop it? Should i set 'special_tokens=false' since im not going to add special tokens? Is the pipeline object (line 4) a pointer to the physical file, or only to the object in memory, if so how do i save it to the disk?

ArthurZucker commented 9 months ago

add_tokens can take an array
You can set special_tokens to false if they are not special
The pipeline object already has the model loaded you can do pipeline.model.save_pretrained("path_you_want") 🤗

SuperBruceJia commented 9 months ago

I finally chose the following solution:


DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_UNK_TOKEN = "<unk>"

def tokenizer_embedding_resize(special_tokens_dict, tokenizer, model):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg

def add_special_token(tokenizer):
    """
    Add special tokens to the tokenizer
    """
    tokenizer.add_special_tokens(
        {
            "pad_token": DEFAULT_PAD_TOKEN,
            "eos_token": DEFAULT_EOS_TOKEN,
            "bos_token": DEFAULT_BOS_TOKEN,
            "unk_token": DEFAULT_UNK_TOKEN,
        }
    )

    return tokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=save_dir,
    model_max_length=train_max_len,
    add_eos_token=True,
    add_bos_token=True,
    padding='longest',
    padding_side="right",
    truncation=True,
    return_tensors="pt",
    use_fast=False,
    trust_remote_code=True,
    use_auth_token=hf_auth_token,
    device_map=device_map,
)
if tokenizer.pad_token is None:
    tokenizer_embedding_resize(
        special_tokens_dict=dict(pad_token="[PAD]"),
        tokenizer=tokenizer,
        model=model,
    )
tokenizer = add_special_token(tokenizer)

# Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
model.resize_token_embeddings(len(tokenizer))

The reference codes for the tokenizer_embedding_resize(): https://github.com/meta-math/MetaMath/blob/main/train_math.py#L90-L110

The reference codes for the add_special_token(): https://github.com/meta-math/MetaMath/blob/main/train_math.py#L259-L279

It works well on my side.

Best regards,

Shuyue Dec. 20th, 2023

TeKett commented 9 months ago

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
import torch 

pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

pipe = StableDiffusionPipeline
num_new_tokens = pipeline.tokenizer.add_tokens(["new_token_1", "new_token_2"], special_tokens=True)

# simple resize
pipeline.text_encoder.resize_token_embeddings(len(pipeline.tokenizer))

# overwrite the content to have better results
input_embeddings = pipeline.text_encoder.get_input_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings)

should work 😉

Im unable to select the checkpoint. Since StableDiffusionPipeline.from_pretrained wants a path to a directory containing a pipeline object. I dont have that, what even is that? I cant stress enough that all i have is a SD 1.5 checkpoint, you know the one you load up into a1111 to generate images, that can be trained using Kohya, that are shared on civitai. I dont have a pipline object, and the Clip model i want to add tokens to is packaged inside of a .safetensor file.

ValueError: The provided pretrained_model_name_or_path "C:/Train/checkpoint.safetensors" is neither a valid local path nor a valid repo id.

If i give it just the directory i get OSError: Error no file named model_index.json found in directory

ArthurZucker commented 9 months ago

Alright, you can't load a pipeline without the configuration and the required sub-checkpoints like here https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main (AFAIK). Would recommend you to ask on the diffusers repo as this is outside the scope of transformers 🤗

kumarme072 commented 9 months ago

@ArthurZucker I have come across many similar issues asking about how to add new tokens to a vocabulary, for reference, here are a couple links to useful comments made for doing roughly that:

https://github.com/huggingface/transformers/issues/1413#issuecomment-538083512 https://github.com/huggingface/transformers/issues/2691#issuecomment-587473545 https://github.com/huggingface/tokenizers/issues/627#issuecomment-784286485 However, I am concerned with how to first identify tokens that make sense to add to an existing tokenizer's vocabulary, and also possibly whether or not it makes sense to consider removing tokens from a vocabulary.

Some context into my situation:

My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks.

But from my current understanding, to first obtain that domain-specific language model, I basically have two options:

train a tokenizer from scratch and then use that tokenizer to train a LM from scratch. modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM. I am struggling with the first option because (as far as I know) training a language model from scratch is quite expensive, and although I do have some budget for this, I do not have on the order of thousands of dollars.

I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause.

To summarize:

I'd really like to know if there is a low cost option for training a LM from scratch to do option 1 above Or, if option 2 makes more sense, how to properly modify a vocabulary (find good new tokens, remove unused ones, etc), and adapt the model to overcome potential negative side effects of messing with the embeddings. Thanks for the help. Sorry for a long question but I thought some context may be needed since I might be asking the wrong question in the first place. Cheers. ##asked by someone.

ArthurZucker commented 9 months ago

Alright. If you need to add new tokens to the vocab but are not sure how, there are a few ways you can do this.

Train a new tokenizer, using https://huggingface.co/learn/nlp-course/chapter6/2#training-a-new-tokenizer. This will make use of https://github.com/huggingface/transformers/blob/3eddda1111f70f3a59485e08540e8262b927e867/src/transformers/tokenization_utils_fast.py#L687 If you have a language specific data that uses none of the "old" tokens, that might be okay, but otherwises as you mentioned you would need to retrain the model.
Train a new small tokenizer on a small corpus, merge the new vocab with the old vocab (merge the vocab and the merges if it is a BPE tokenizer by just adding the new tokens at the end) More on that here https://github.com/huggingface/tokenizers/issues/1109. Might not be optimal but if certain languages have less tokens it should be alright.
Manually add all the new tokens using add_tokens(), which will just be adding characters / words for simplicity. Growing the vocab exponentially potentially if the vocab of the language is huge.

I think that's pretty much it 😓

TeKett commented 7 months ago

So i asked over on diffusers and got no answer, then i asked again and got a response. All they did was argue over why i should not user their project for the exact reasons why the project exists... Basically said "Why are you trying to do this instead of being a sheep?", and they whont answer why the code is erroring out.

It errors out at line 18. cannot assign 'torch.FloatTensor' as child module 'token_embedding'

from diffusers import StableDiffusionPipeline

array = []
with open("D:/tagstest.txt",encoding="utf8") as file:
        array = [row.rstrip("\n") for row in file.readlines()]

pipeline = StableDiffusionPipeline.from_single_file("C:/Train/checkpoint.safetensors")

num_new_tokens = pipeline.tokenizer.add_tokens(array, special_tokens=False)

# simple resize (is this correct?)
pipeline.text_encoder.resize_token_embeddings(len(pipeline.tokenizer))

# overwrite the content to have better results
input_embeddings = pipeline.text_encoder.get_input_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings) # Error
pipeline.model.save_pretrained("c:/test")

ArthurZucker commented 7 months ago

pipeline.text_encoder.set_input_embeddings(input_embeddings) should be given a nn.Embedding if I am not mistakend. Thus you first get_input_embedding change the data, and then set_input_embeddings

TeKett commented 7 months ago

Problem was the types. .weight.data is a torch.FloatTensor, while get_input_embeddings() is a nn.whatever. I could just ommit this completely, no? Since all it does it unlearn the model?

input_embeddings = pipeline.text_encoder.get_input_embeddings()
input_embeddings_avg = input_embeddings.weight.data[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings.weight.data[-num_new_tokens:] = input_embeddings_avg
pipeline.text_encoder.set_input_embeddings(input_embeddings)

This likely falls under transformers, since i think its just a text model issue. When i try to load the model again im getting

RuntimeError: Error(s) in loading state_dict for CLIPTextModel:
    size mismatch for text_model.embeddings.token_embedding.weight: copying a param with shape torch.Size([90323, 768]) from checkpoint, the shape in current model is torch.Size([49408, 768]).

What is "current model"?

ArthurZucker commented 7 months ago

That just means the config.vocab_size is wrong and should be updated to 90323 current model is the one initialized with the config

TeKett commented 7 months ago

That just means the config.vocab_size is wrong and should be updated to 90323 current model is the one initialized with the config

You mean the config.json file in the text_encoder folder? It already says the new number.

{
  "architectures": [
    "CLIPTextModel"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "dropout": 0.0,
  "eos_token_id": 2,
  "hidden_act": "quick_gelu",
  "hidden_size": 768,
  "initializer_factor": 1.0,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 77,
  "model_type": "clip_text_model",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "projection_dim": 768,
  "torch_dtype": "float32",
  "transformers_version": "4.35.2",
  "vocab_size": 90320
}

feliperviegas commented 7 months ago

Alright. If you need to add new tokens to the vocab but are not sure how, there are a few ways you can do this.

Train a new tokenizer, using https://huggingface.co/learn/nlp-course/chapter6/2#training-a-new-tokenizer. This will make use of https://github.com/huggingface/transformers/blob/3eddda1111f70f3a59485e08540e8262b927e867/src/transformers/tokenization_utils_fast.py#L687 If you have a language specific data that uses none of the "old" tokens, that might be okay, but otherwises as you mentioned you would need to retrain the model.

Train a new small tokenizer on a small corpus, merge the new vocab with the old vocab (merge the vocab and the merges if it is a BPE tokenizer by just adding the new tokens at the end) More on that here How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer? tokenizers#1109. Might not be optimal but if certain languages have less tokens it should be alright.

Manually add all the new tokens using add_tokens(), which will just be adding characters / words for simplicity. Growing the vocab exponentially potentially if the vocab of the language is huge.

I think that's pretty much it 😓

Hi @ArthurZucker, I tried to extend the tokenizer vocabulary using add_tokens method, but I got an odd behavior, not sure if I used correctly. I will try to demonstrate in the following example:

from transformers import BertTokenizer
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokens = original_tokenizer.tokenize(text)
original_tokens # And here the tokenizer know the token, returning it with no issues.

Then I tried to add a token to the tokenizer

from transformers import BertTokenizer
original_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

text = "California"
original_tokenizer.add_tokens(["rn"], special_tokens=False)
original_tokens = original_tokenizer.tokenize(text) # And here it returns ['Cal', '##if', '##o', 'rn', 'i', '##a']

I didn't understand why the tokenizer behaved like this. If I add a token B that is a substring of another token A, does this imply that the tokenizer will not recognize the A, like in the example?

huggingface / transformers

Adding New Vocabulary Tokens to the Models #1413

❓ Questions & Help