Group beam search decoded result depends on pad_token_id even though it's not printable

Wovchena commented 11 months ago

System Info

transformers Version: 4.35.2 Windows and Ubuntu20 Python 3.11.3

Who can help?

@gante

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

import transformers

tokenizer = transformers.LlamaTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v0.6')
input_ids = tokenizer('Hi', return_tensors='pt')['input_ids']
model = transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v0.6')
print(model.generation_config.eos_token_id)
pad_token_id = 0
assert pad_token_id != model.generation_config.eos_token_id
zero = [tokenizer.decode(beam, skip_special_tokens=True) for beam in model.generate(input_ids, max_new_tokens=25, num_beam_groups=9, num_beams=99, num_return_sequences=99, diversity_penalty=1.0, no_repeat_ngram_size=3, do_sample=False, pad_token_id=pad_token_id)]
print(zero)
pad_token_id = 1
assert pad_token_id != model.generation_config.eos_token_id
one = [tokenizer.decode(beam, skip_special_tokens=True) for beam in model.generate(input_ids, max_new_tokens=25, num_beam_groups=9, num_beams=99, num_return_sequences=99, diversity_penalty=1.0, no_repeat_ngram_size=3, do_sample=False, pad_token_id=pad_token_id)]
print(one)
assert zero == one

Expected behavior

Using a non printable pad_token_id should result in same text generation.

When a group is completed, group beam search keeps padding ongoing beams from this group. That in turn affects, how diversity penalty is applied to tokens for other groups. And thus different tokens are chosen.

I believe a completed group should not affect log probabilities for other groups.

ArthurZucker commented 11 months ago

Hey, both padding tokens are not non-printable you just decided to skip them using skip_special_tokens = True. You should try to set the padding to the left as it is recommended for generation and the padding token on the right will always impact the model. more details here

Wovchena commented 11 months ago

Hi. skip_special_tokens=True is intentional. How do I modify the reproducer to set the padding to the left?

ArthurZucker commented 11 months ago

tokenizer.padding_side = "left"

Wovchena commented 11 months ago

The results are still different:

import transformers

tokenizer = transformers.LlamaTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v0.6')
tokenizer.padding_side = "left"
input_ids = tokenizer('Hi', return_tensors='pt')['input_ids']
model = transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v0.6')
print(model.generation_config.eos_token_id)
pad_token_id = 0
assert pad_token_id != model.generation_config.eos_token_id
zero = [tokenizer.decode(beam, skip_special_tokens=True) for beam in model.generate(input_ids, max_new_tokens=25, num_beam_groups=9, num_beams=99, num_return_sequences=99, diversity_penalty=1.0, no_repeat_ngram_size=3, do_sample=False, pad_token_id=pad_token_id)]
print(zero)
pad_token_id = 1
assert pad_token_id != model.generation_config.eos_token_id
one = [tokenizer.decode(beam, skip_special_tokens=True) for beam in model.generate(input_ids, max_new_tokens=25, num_beam_groups=9, num_beams=99, num_return_sequences=99, diversity_penalty=1.0, no_repeat_ngram_size=3, do_sample=False, pad_token_id=pad_token_id)]
print(one)
assert zero == one

My explanation why your suggestion didn't help is that the problem I'm referring to lies inside of model.generate(), which doesn't involve tokenizer.

gante commented 11 months ago

Hi @Wovchena 👋

The issue with your script was that you were not passing the attention mask to generate :)

Throughout transformers, we do our best effort to infer the attention mask when it is not passed: if the token is equal to the pad token the attention mask is 0 (and 1 otherwise). In your particular example, you were setting the pad token to the bos token (both 1), the token that the tokenizer uses to signal the beginning of the sequence. As such, the inferred attention mask was different in the two inputs, leading to a different output.

We always recommend passing the attention mask 🤗

Working example (passing the attention mask):

import transformers

tokenizer = transformers.LlamaTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v0.6')
model = transformers.LlamaForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v0.6')

tokenizer.padding_side = "left"
input_ids = tokenizer('Hi', return_tensors='pt')
print(model.generation_config.eos_token_id)

pad_token_id = 0
assert pad_token_id != model.generation_config.eos_token_id
gen_zero = model.generate(**input_ids, max_new_tokens=25, num_beam_groups=9, num_beams=99, num_return_sequences=99, diversity_penalty=1.0, no_repeat_ngram_size=3, do_sample=False, pad_token_id=pad_token_id)
zero = [tokenizer.decode(beam, skip_special_tokens=True) for beam in gen_zero]

pad_token_id = 1
assert pad_token_id != model.generation_config.eos_token_id
gen_one = model.generate(**input_ids, max_new_tokens=25, num_beam_groups=9, num_beams=99, num_return_sequences=99, diversity_penalty=1.0, no_repeat_ngram_size=3, do_sample=False, pad_token_id=pad_token_id)
one = [tokenizer.decode(beam, skip_special_tokens=True) for beam in gen_one]

assert zero == one

Wovchena commented 11 months ago

Thank you!

huggingface / transformers