Mismatch between beam search score transition probabilities and beam sequence scores

PastelBelem8 commented 2 years ago

Environment info

transformers version: 4.16.2
Platform: Linux-5.13.0-30-generic-x86_64-with-glibc2.31
Python version: 3.9.7
PyTorch version (GPU?): 1.10.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

I'm tagging @patrickvonplaten because he has recently worked on this #14654

Information

The model is a T5. I'm using its conditional generation properties and testing beam search decoding outputs:

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

Purpose: Create two simple dummy events and test whether the transition probability scores are the same as the probability sequence scores. The goal is to understand what the sequence scores represent (are they unnormalized by length? normalized by length?). From the PR #14654 discussion, I had the impression that it would be enough to sum the transition probabilities but these do not seem to match. Would you please help me understand?

To reproduce

Steps to reproduce the behavior:

Create two simple dummy input examples.
Encode them using a t5 model (model) and generate the output scores using beam search.
Obtain transition scores through the use of model.compute_transition_beam_scores
Sum the results of 3 and compare with the ones returned with the BeamSearchEncoderDecoderOutput.sequence_scores.

import transformers

model_name = "t5-small"
encoding_hyperparameters = {
    "padding": "max_length",
    "max_length": 512,
    "truncation": True,
    "add_special_tokens": True,
    "return_attention_mask": True,
    "return_tensors": "pt",
}

tokenizer = transformers.T5TokenizerFast.from_pretrained(model_name)
model = transformers.T5ForConditionalGeneration.from_pretrained(model_name)

EXAMPLE = ["question: How are you? \n context: I had a long day, she said. I am so exhausted.", 
           "question: What did the fox do? \n context: The fox jumped over the fence into a very green lawn."]

BEAM_SEARCH_KWARGS = {
    "num_beams": 4,
    "do_sample": False,
    "num_return_sequences": 1,
}

# Encode inputs
inputs_ids = tokenizer(EXAMPLE, **encoding_hyperparameters)

# Generate using beam search
beamsearch_results = model.generate(
    input_ids=inputs_ids["input_ids"],
    attention_mask=inputs_ids["attention_mask"],    
    max_length=10,
    return_dict_in_generate=True,
    output_scores=True,
    # the id of the token to force as the last generated token when max_length is reached
    forced_eos_token_id=tokenizer.eos_token_id,
    **BEAM_SEARCH_KWARGS
)

trs_bs = model.compute_transition_beam_scores(
    sequences=beamsearch_results.sequences,
    scores=beamsearch_results.scores, 
    beam_indices=beamsearch_results.beam_indices
)

print("Summ:", torch.sum(trs_bs, dim=1), "Expected:", beamsearch_results.sequences_scores)
print("Sum/length:", torch.sum(trs_bs, dim=1)/beamsearch_results.beam_indices.shape[-1], "Expected:", beamsearch_results.sequences_scores)
# output
# Sum: tensor([-1.5411, -0.3851]) Expected: tensor([-0.1712, -0.0428])
# Sum/length: tensor([-0.1712, -0.0428]) Expected: tensor([-0.1712, -0.0428])

From the example above I deduced that in order to obtain the same scores as those computed in the sequences_scores it would suffice to divide by the length of the sentences. In this case, it seems to work nicely because both sequences have the same length:

# output of beamsearch_results.sequences
tensor([[    0,    27,   141,     3,     9,   307,   239,     6,   255,     1],
        [    0,     3, 16287,   147,     8,  8227,   139,     3,     9,     1]])

So I tried a different example, that would cause the beamsearch_results.sequences to be different:

# Example 2
# The only change to the script above is the example, where we modify the first sequence in the batch
EXAMPLE = ["question: Is this yes or no question? \n context: It is yes", 
           "question: What did the fox do? \n context: The fox jumped over the fence into a very green lawn."]
# ...

print("Sum:", torch.sum(trs_bs, dim=1), "Expected:", beamsearch_results.sequences_scores)
print("Sum/length:", torch.sum(trs_bs, dim=1)/beamsearch_results.beam_indices.shape[-1], "Expected:", beamsearch_results.sequences_scores)
print("Sum/rel_length:", torch.sum(trs_bs, dim=1) / torch.sum(trs_bs != 0, dim=1), "Expected:", beamsearch_results.sequences_scores)
# outputs
# Sum: tensor([-0.0770, -0.3851]) Expected: tensor([-0.0385, -0.0428])
# Sum/length: tensor([-0.0086, -0.0428]) Expected: tensor([-0.0385, -0.0428])
# Sum/rel_length: tensor([-0.0385, -0.0481]) Expected: tensor([-0.0385, -0.0428])

The output of beamsearch_results.sequences for the above example is:

tensor([[    0,  4273,     1,     0,     0,     0,     0,     0,     0,     0],
        [    0,     3, 16287,   147,     8,  8227,   139,     3,     9,     1]])

The difference from Sum/length to Sum/rel_length is that in the former I divide by the maximum length of the generated sentences, whereas the previous is divided by the number of non-zero transition probabilities. We can see that for the latter case, (i.e., when dividing by the relative length) only the first example score is matched to the original beamsearch_results.sequences_scores).

Will you please help me better understand the computation of these probabilities and their connection with the sequence_scores? In particular, are the individual scores returned by the compute_transition_beam_scores length-normalized ? Do these individual scores aim to represent the joint probability or are they representing the individual probabilities? Are we supposed to consider the initial padding token when computing the scores?

Thanks in advance for your time!

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten commented 2 years ago

Hey @PastelBelem8,

Sorry to answer so late. We've added some tests to make sure the transition probabilities work correctly. Could you take a look at this answer: https://github.com/huggingface/transformers/issues/16413#issuecomment-1088907369 and see whether it applies to your use case?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

hacobe commented 2 years ago

I find the code above from @PastelBelem8 works if I set the length_penalty to 0. However, if I change the prompt so that the model produces a completion that has fewer tokens than the max_length, then the sequences_scores and the output of compute_transition_beam_scores are very different again. @patrickvonplaten, any thoughts on what might be going on there? Thanks in advance for your help!

See the code in this colab: https://colab.research.google.com/drive/1KAc_Mk2k8qiiqKgXfAcggJWfaBEvvzox#scrollTo=6in7zwm7Dqxf

patrickvonplaten commented 2 years ago

Thanks for the script @hacobe,

I can reproduce the problem. It looks like an error in Transformers. Investigating now

patrickvonplaten commented 2 years ago

Hey @hacobe,

The problem is actually much more difficult than I thought. It'll need a bigger refactor of the beam scorer. I keep you updated!

hacobe commented 2 years ago

Got it. Thanks for looking into it!

superhero-7 commented 2 years ago

@PastelBelem8 @hacobe @patrickvonplaten I am confues in some things（I also work on T5 model in some Vision-Language task）:

Do you find that the first token id in every beam of beamsearch_results.sequences always equal to zero?And I find that length of every beam of beamsearch_results.sequences seems alway equal to length of scores tuple plus one.
I noticed these questions because I am recently woking on RINFORCE algorithm using CIDEr reward.And I think most people want to get the transition probability may be also want to use RINFORCE algorithm, but I think we don't need the probability of every token in every time step, what I need it's the probability of the sequence,which can be formalize to P(W) = P(Wt|Wt-1,Wt-2,...,W1)P(Wt-2|Wt-3...Wt-1)P(W1);And I think the last indice of return scores tuple could represent the probablity of the sequence.I don't know whether I miss something or not?And I don't konw can I calculate the gradient of P(W) (which attaind from last indice of return scores )using standard backpropagation algorithm?
And right now I am working on an old version transformer which doesn't support the newest function 'compute_transition_beam_scores', are ther any method to avoid upgrade the whole transformer, but I can also use the function 'compute_transition_beam_scores'?

I appreciate it very much if anyone can give me any advice!!!

PastelBelem8 commented 1 year ago

@superhero-7 the first token id in every beam search is always 0 because the model introduces a pad token for every possible continuation of the string you give as input to the generate method.

huggingface / transformers