Closed NtaylorOX closed 2 years ago
Hey @NtaylorOX,
Sorry I'm not following a 100% here what the problem is here. I can run all of the above samples without a problem and I don't see exactly what the bug is here. Could you maybe copy-paste a single code snippet here that shows the error and then explain what the output should be? :-)
From what I understand, there is a problem when adding the <pad_token>
to GPT2's tokenizer? Why is OPT used in the example here?
Hi! Thanks for the reply @patrickvonplaten
So there was actually a bug in my issue! The output was meant to be fully of
So what happens is that when you update the GPT2 tokenizer via add_special_tokens - the generate function ends up just predicting those new additional tokens repeatedly. You can see the output in full in the colab notebook.
I believe my issue has the appropriate code snippets with output - although I may have made it a bit messy.
The point here is that the using the prompt: "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: "
The untouched gpt model generates: "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: Spain. Not just the capital of a country; the capital of Europe. "
But when you add any special token, such as
I am 99% sure that adding special tokens should not be intefering with the ability of the model to generate in this way.
The reason for using OPT is because it essentially uses same tokenizer class and the problem doesn't occur for it. But it has occured for all gpt2 variants I've tried.
Has this cleared it up at all?
Again, I think its clearer in the colab notebook
Hey @NtaylorOX,
So I guess you're referring to this code snippet here:
# Declare special tokens for padding and separating the context from the slogan:
SPECIAL_TOKENS_DICT = {
'pad_token': '<pad>',
}
# # Add these special tokens to the vocabulary and resize model's embeddings:
tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
model.resize_token_embeddings(len(tokenizer))
# Show the full list of special tokens:
print(tokenizer.special_tokens_map)
# run same prompt
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)
tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
which then generates the <pad>
token as an output (but isn't this expected since you set skip_special_tokens=False
?
Sorry I'm still not 100% certain I understand what you mean. Could you please post a single code snippet that I can just copy-paste and run and that shows me an output and a message what the output should have been instead?
This would be super nice - sorry I'm a bit lost here
Hi @patrickvonplaten,
Thanks for persisting with my confusing post :D.
Yes The following snippest is the main concern:
# Declare special tokens for padding and separating the context from the slogan:
SPECIAL_TOKENS_DICT = {
'pad_token': '<pad>',
}
# # Add these special tokens to the vocabulary and resize model's embeddings:
tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
model.resize_token_embeddings(len(tokenizer))
# Show the full list of special tokens:
print(tokenizer.special_tokens_map)
# run same prompt
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)
tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
The expected output for gpt2-medium would be the same as the output before adding the special tokens, which would be:
"Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: Basel. Capital of Liechtenstein is: Liechtenstein. Capital of Mexico is: Mexico City. Capital of South Africa is: Cape Town...."
So nice and sensible output. To my understanding, and the way it works with non-gpt2 models, is that adding special tokens should not lead to a different output, but it does.
Again, after adding special tokens as desribed above, the output becomes:
"'Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is pad pad pad pad ...".
To me this seems wrong? The output should be the same as it was originally, but its unable to produce anything other than pad tokens when generating now. And if you inspect the input ids etc, there is no pad token encoded by the tokenizer, nor is there any padding as its a single sample.
Has this made anything clearer?
Haha we'll get there @NtaylorOX :-)
Right now when running your last code snippet, I get:
NameError Traceback (most recent call last)
<ipython-input-1-d3e787aeade6> in <module>
5
6 # # Add these special tokens to the vocabulary and resize model's embeddings:
----> 7 tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
8 model.resize_token_embeddings(len(tokenizer))
9
NameError: name 'tokenizer' is not defined
Could you fix the code snippet so that I can run it in a Python shell to see the output expected by you?
Now I'm confused. In your comment did you mean to put the code snippest after "I get:"?
I read this as you would be posting the output from running the code?
To get what I believe to produce the "incorrect output", run this:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoModelForMaskedLM, AutoTokenizer, set_seed
import os
import torch
import csv
import torch
from torch.utils.data import Dataset
cuda_device = torch.device('cuda:0')
# now set the default gpu to this one
torch.cuda.set_device(cuda_device)
# set model name and load in using transformers automodel/autotokenizer classes
# use smallest gpt2 type model but can use others
MODEL_NAME = 'distilgpt2' #'distilgpt2' 'gpt2-medium' 'gpt2
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
# Declare special tokens for padding and separating the context from the slogan:
SPECIAL_TOKENS_DICT = {
'pad_token': '<pad>',
}
# # Add these special tokens to the vocabulary and resize model's embeddings:
tokenizer.add_special_tokens(SPECIAL_TOKENS_DICT)
model.resize_token_embeddings(len(tokenizer))
# Show the full list of special tokens:
print(tokenizer.special_tokens_map)
# run same prompt
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)
tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
To get what the output should be and normally is without special tokens:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoModelForMaskedLM, AutoTokenizer, set_seed
import os
import torch
import csv
import torch
from torch.utils.data import Dataset
cuda_device = torch.device('cuda:0')
# now set the default gpu to this one
torch.cuda.set_device(cuda_device)
# set model name and load in using transformers automodel/autotokenizer classes
# use smallest gpt2 type model but can use others
MODEL_NAME = 'distilgpt2' #'distilgpt2' 'gpt2-medium' 'gpt2
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
# run same prompt
prompt = "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=200)
tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
Does this help?
Hey @NtaylorOX,
Sorry just corrected my comment above. Ok I think I see what the problem is. You've added a token and now this token is predominantly generated. IMO this is not because it's called a <pad>
token, it's simply due to the pretrained weights of distilgpt2
.
Also see this issue: https://github.com/huggingface/transformers/issues/8472
Hi @patrickvonplaten ,
Yes - I did not mean it was only affecting
Also, as I mentioned, it does not only affect distilgpt - it affects all GPT2 models I tried. But does not happen to OPT model which I was i found it odd?
Also - on that other issue: https://github.com/huggingface/transformers/issues/8472
When using your nicely supplied possible fix:
import torch
import torch.nn.functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
tokenizer.add_special_tokens(
{'additional_special_tokens': ['<USER>', '<SYSTEM>']}
)
model = GPT2LMHeadModel.from_pretrained('distilgpt2')
model.resize_token_embeddings(len(tokenizer))
inp_tok_ids = tokenizer.encode('I want a pepperoni pizza with mushroom')
inp_tensor = torch.LongTensor(inp_tok_ids).unsqueeze(0)
model.eval()
model.lm_head.weight[-2, :] = (torch.zeros((768,)) - 10000.0)
model.lm_head.weight[-1, :] = (torch.zeros((768,)) - 10000.0)
with torch.no_grad():
for i in range(10):
outputs = model(inp_tensor)
logits = outputs[0][:, -1, :]
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
inp_tensor = torch.cat([inp_tensor, next_token.unsqueeze(-1)], dim=-1)
print(tokenizer.decode(inp_tensor[0]))
I am getting an error:
RuntimeError Traceback (most recent call last)
model.eval()
----> 3 model.lm_head.weight[-2, :] = (torch.zeros((768,)) - 10000.0)
4 model.lm_head.weight[-1, :] = (torch.zeros((768,)) - 10000.0)
6 with torch.no_grad():
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.
Sorry if I shouldn't be crossing wires so much! Just wanted to highlight that this example doesn't seem to work, at least with my transformers version etc.
System Info
Who can help?
@patrickvonplaten
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The problem occurs when using GPT2 based models with the transformers library, and specifically when using the model.generate() after adding new special tokens, or tokens.
I have put together a colab for this issue here: https://colab.research.google.com/gist/NtaylorOX/56c3578c1bfe6d6f5ec35ed0641c5e98/hf_gpt2_generate_bug.ipynb.
Steps to reproduce:
1.) Load in libraries and instantiate a GPT2 based model
2.) Sanity check
Outputs:
Setting
pad_token_id
toeos_token_id
:50256 for open-end generation.['Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: Spain. Not just the capital of a country; the capital of Europe....]
3.) Add additional special tokens such as \
Outputs: {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': ''}
4.) Now run through the generate process again
output: 'Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is pad pad pad'
Thistoken issue can be fixed by instead setting pad_token_id to eos_token_id via:
But with other special tokens the problem persists. Please see the colab notebook for more detailed examples.
Expected behavior
output: "Capital of England is: London. Capital of France is: Paris. Capital of Spain is: Madrid. Capital of Switzerland is: Switzerland. Capital of Italy is: Naples. Capital of France is: Rome. Capital of Spain is: Madrid. "
This output is as it should be - but when using GPT2 based models, something goes wrong.
If this is not a bug, and expected behaviour based on something I've missed, please let me know!