huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.19k stars 26.6k forks source link

GPT-2 based models generation breaks when adding new special tokens #17690

Closed NtaylorOX closed 2 years ago

NtaylorOX commented 2 years ago

System Info

- `transformers` version: 4.19.4
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.7.0
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- Tensorflow version (GPU?): 2.8.2 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: (True)
- Using distributed or parallel set-up in script?: (False)

Who can help?

@patrickvonplaten

Information

Tasks

Reproduction

The problem occurs when using GPT2 based models with the transformers library, and specifically when using the model.generate() after adding new special tokens, or tokens.

I have put together a colab for this issue here: https://colab.research.google.com/gist/NtaylorOX/56c3578c1bfe6d6f5ec35ed0641c5e98/hf_gpt2_generate_bug.ipynb.

Steps to reproduce:

1.) Load in libraries and instantiate a GPT2 based model

from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoModelForMaskedLM, AutoTokenizer, set_seed
import os 
import torch
import csv

import torch
from torch.utils.data import Dataset

cuda_device = torch.device('cuda:0')
# now set the default gpu to this one
torch.cuda.set_device(cuda_device)

# set model name and load in using transformers automodel/autotokenizer classes
# use smallest gpt2 type model but can use others
MODEL_NAME = 'distilgpt2' #'distilgpt2' 'gpt2-medium' 'gpt2

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

2.) Sanity check

# test its ability with few easy examples
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)

tokenizer.batch_decode(generated_ids, skip_special_tokens=False)

Outputs:

Setting pad_token_id to eos_token_id:50256 for open-end generation.

['Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: Spain. Not just the capital of a country; the capital of Europe....]

3.) Add additional special tokens such as \

# Declare special tokens for padding and separating the context from the slogan:
SPECIAL_TOKENS_DICT = {
    'pad_token': '<pad>',    
}

# # Add these special tokens to the vocabulary and resize model's embeddings:
tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
model.resize_token_embeddings(len(tokenizer))

# Show the full list of special tokens:
print(tokenizer.special_tokens_map)

Outputs: {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': ''}

4.) Now run through the generate process again

# run same prompt
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)

tokenizer.batch_decode(generated_ids, skip_special_tokens=False)

output: 'Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is pad pad pad'

This token issue can be fixed by instead setting pad_token_id to eos_token_id via:

tokenizer.pad_token = tokenizer.eos_token

But with other special tokens the problem persists. Please see the colab notebook for more detailed examples.

Expected behavior

The adding of new special tokens and subsequence resizing of the model embeddings should leave a model performing in its original pre-trained state when given known tokens. 

For example, this problem does not occur with a similar autoregressive model, "facebook/opt".

MODEL_NAME = "facebook/opt-350m"
# reload model and tokenizer from its original pre-trained state

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Declare special tokens for padding and separating the context from the slogan:
SPECIAL_TOKENS_DICT = {    
    'additional_special_tokens': ['<context>', '<slogan>']    
}

# OPT already has a <pad> token so add other special tokens to the vocabulary and resize model's embeddings:
tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
model.resize_token_embeddings(len(tokenizer))

# run same single prompt as before
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)

tokenizer.batch_decode(generated_ids, skip_special_tokens=False)

output: "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: Switzerland. Capital of Italy is: Naples. Capital of France is: Rome. Capital of Spain is: Madrid. "

This output is as it should be - but when using GPT2 based models, something goes wrong.

If this is not a bug, and expected behaviour based on something I've missed, please let me know!

patrickvonplaten commented 2 years ago

Hey @NtaylorOX,

Sorry I'm not following a 100% here what the problem is here. I can run all of the above samples without a problem and I don't see exactly what the bug is here. Could you maybe copy-paste a single code snippet here that shows the error and then explain what the output should be? :-)

From what I understand, there is a problem when adding the <pad_token> to GPT2's tokenizer? Why is OPT used in the example here?

NtaylorOX commented 2 years ago

Hi! Thanks for the reply @patrickvonplaten

So there was actually a bug in my issue! The output was meant to be fully of tokens or whatever additioanl special tokens had been added - but it seems markdown was showing/compiling these. I've updated comment now.

So what happens is that when you update the GPT2 tokenizer via add_special_tokens - the generate function ends up just predicting those new additional tokens repeatedly. You can see the output in full in the colab notebook.

I believe my issue has the appropriate code snippets with output - although I may have made it a bit messy.

The point here is that the using the prompt: "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: "

The untouched gpt model generates: "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: Spain. Not just the capital of a country; the capital of Europe. "

But when you add any special token, such as token using add_special_tokens and resize the embeddings of the model. You get "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: pad pad pad pad pad or whatever special token you added."

I am 99% sure that adding special tokens should not be intefering with the ability of the model to generate in this way.

The reason for using OPT is because it essentially uses same tokenizer class and the problem doesn't occur for it. But it has occured for all gpt2 variants I've tried.

Has this cleared it up at all?

Again, I think its clearer in the colab notebook

patrickvonplaten commented 2 years ago

Hey @NtaylorOX,

So I guess you're referring to this code snippet here:

# Declare special tokens for padding and separating the context from the slogan:
SPECIAL_TOKENS_DICT = {
    'pad_token': '<pad>',    
}

# # Add these special tokens to the vocabulary and resize model's embeddings:
tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
model.resize_token_embeddings(len(tokenizer))

# Show the full list of special tokens:
print(tokenizer.special_tokens_map)
# run same prompt
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)

tokenizer.batch_decode(generated_ids, skip_special_tokens=False)

which then generates the <pad> token as an output (but isn't this expected since you set skip_special_tokens=False?

Sorry I'm still not 100% certain I understand what you mean. Could you please post a single code snippet that I can just copy-paste and run and that shows me an output and a message what the output should have been instead?

This would be super nice - sorry I'm a bit lost here

NtaylorOX commented 2 years ago

Hi @patrickvonplaten,

Thanks for persisting with my confusing post :D.

Yes The following snippest is the main concern:

# Declare special tokens for padding and separating the context from the slogan:
SPECIAL_TOKENS_DICT = {
    'pad_token': '<pad>',    
}

# # Add these special tokens to the vocabulary and resize model's embeddings:
tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
model.resize_token_embeddings(len(tokenizer))

# Show the full list of special tokens:
print(tokenizer.special_tokens_map)
# run same prompt
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)

tokenizer.batch_decode(generated_ids, skip_special_tokens=False)

The expected output for gpt2-medium would be the same as the output before adding the special tokens, which would be:

"Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: Basel. Capital of Liechtenstein is: Liechtenstein. Capital of Mexico is: Mexico City. Capital of South Africa is: Cape Town...."

So nice and sensible output. To my understanding, and the way it works with non-gpt2 models, is that adding special tokens should not lead to a different output, but it does.

Again, after adding special tokens as desribed above, the output becomes:

"'Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is pad pad pad pad ...".

To me this seems wrong? The output should be the same as it was originally, but its unable to produce anything other than pad tokens when generating now. And if you inspect the input ids etc, there is no pad token encoded by the tokenizer, nor is there any padding as its a single sample.

Has this made anything clearer?

patrickvonplaten commented 2 years ago

Haha we'll get there @NtaylorOX :-)

Right now when running your last code snippet, I get:

NameError                                 Traceback (most recent call last)
<ipython-input-1-d3e787aeade6> in <module>
      5
      6 # # Add these special tokens to the vocabulary and resize model's embeddings:
----> 7 tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
      8 model.resize_token_embeddings(len(tokenizer))
      9

NameError: name 'tokenizer' is not defined

Could you fix the code snippet so that I can run it in a Python shell to see the output expected by you?

NtaylorOX commented 2 years ago

Now I'm confused. In your comment did you mean to put the code snippest after "I get:"?

I read this as you would be posting the output from running the code?

To get what I believe to produce the "incorrect output", run this:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoModelForMaskedLM, AutoTokenizer, set_seed
import os 
import torch
import csv

import torch
from torch.utils.data import Dataset

cuda_device = torch.device('cuda:0')
# now set the default gpu to this one
torch.cuda.set_device(cuda_device)

# set model name and load in using transformers automodel/autotokenizer classes
# use smallest gpt2 type model but can use others
MODEL_NAME = 'distilgpt2' #'distilgpt2' 'gpt2-medium' 'gpt2

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Declare special tokens for padding and separating the context from the slogan:
SPECIAL_TOKENS_DICT = {
    'pad_token': '<pad>',    
}

# # Add these special tokens to the vocabulary and resize model's embeddings:
tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
model.resize_token_embeddings(len(tokenizer))

# Show the full list of special tokens:
print(tokenizer.special_tokens_map)
# run same prompt
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)

tokenizer.batch_decode(generated_ids, skip_special_tokens=False)

To get what the output should be and normally is without special tokens:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoModelForMaskedLM, AutoTokenizer, set_seed
import os 
import torch
import csv

import torch
from torch.utils.data import Dataset

cuda_device = torch.device('cuda:0')
# now set the default gpu to this one
torch.cuda.set_device(cuda_device)

# set model name and load in using transformers automodel/autotokenizer classes
# use smallest gpt2 type model but can use others
MODEL_NAME = 'distilgpt2' #'distilgpt2' 'gpt2-medium' 'gpt2

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# run same prompt
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)

tokenizer.batch_decode(generated_ids, skip_special_tokens=False)

Does this help?

patrickvonplaten commented 2 years ago

Hey @NtaylorOX,

Sorry just corrected my comment above. Ok I think I see what the problem is. You've added a token and now this token is predominantly generated. IMO this is not because it's called a <pad> token, it's simply due to the pretrained weights of distilgpt2.

Also see this issue: https://github.com/huggingface/transformers/issues/8472

NtaylorOX commented 2 years ago

Hi @patrickvonplaten ,

Yes - I did not mean it was only affecting tokens. But it seems I did not find that previous issue which seems to address the problem.

Also, as I mentioned, it does not only affect distilgpt - it affects all GPT2 models I tried. But does not happen to OPT model which I was i found it odd?

NtaylorOX commented 2 years ago

Also - on that other issue: https://github.com/huggingface/transformers/issues/8472

When using your nicely supplied possible fix:

import torch
import torch.nn.functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
tokenizer.add_special_tokens(
    {'additional_special_tokens': ['<USER>', '<SYSTEM>']}
)

model = GPT2LMHeadModel.from_pretrained('distilgpt2')
model.resize_token_embeddings(len(tokenizer))
inp_tok_ids = tokenizer.encode('I want a pepperoni pizza with mushroom')
inp_tensor = torch.LongTensor(inp_tok_ids).unsqueeze(0)
model.eval()

model.lm_head.weight[-2, :] = (torch.zeros((768,)) - 10000.0) 
model.lm_head.weight[-1, :] = (torch.zeros((768,)) - 10000.0) 

with torch.no_grad():
    for i in range(10):
        outputs = model(inp_tensor)
        logits = outputs[0][:, -1, :]
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
        inp_tensor = torch.cat([inp_tensor, next_token.unsqueeze(-1)], dim=-1)

print(tokenizer.decode(inp_tensor[0]))

I am getting an error:

RuntimeError                              Traceback (most recent call last)

model.eval()
----> 3 model.lm_head.weight[-2, :] = (torch.zeros((768,)) - 10000.0) 
      4 model.lm_head.weight[-1, :] = (torch.zeros((768,)) - 10000.0) 
      6 with torch.no_grad():

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

Sorry if I shouldn't be crossing wires so much! Just wanted to highlight that this example doesn't seem to work, at least with my transformers version etc.