Closed PastelBelem8 closed 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey @PastelBelem8,
Sorry to answer so late. We've added some tests to make sure the transition probabilities work correctly. Could you take a look at this answer: https://github.com/huggingface/transformers/issues/16413#issuecomment-1088907369 and see whether it applies to your use case?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I find the code above from @PastelBelem8 works if I set the length_penalty to 0. However, if I change the prompt so that the model produces a completion that has fewer tokens than the max_length, then the sequences_scores and the output of compute_transition_beam_scores are very different again. @patrickvonplaten, any thoughts on what might be going on there? Thanks in advance for your help!
See the code in this colab: https://colab.research.google.com/drive/1KAc_Mk2k8qiiqKgXfAcggJWfaBEvvzox#scrollTo=6in7zwm7Dqxf
Thanks for the script @hacobe,
I can reproduce the problem. It looks like an error in Transformers. Investigating now
Hey @hacobe,
The problem is actually much more difficult than I thought. It'll need a bigger refactor of the beam scorer. I keep you updated!
Got it. Thanks for looking into it!
@PastelBelem8 @hacobe @patrickvonplaten I am confues in some things(I also work on T5 model in some Vision-Language task):
I appreciate it very much if anyone can give me any advice!!!
@superhero-7 the first token id in every beam search is always 0 because the model introduces a pad token for every possible continuation of the string you give as input to the generate method.
Environment info
transformers
version: 4.16.2Who can help
I'm tagging @patrickvonplaten because he has recently worked on this #14654
Information
The model is a T5. I'm using its conditional generation properties and testing beam search decoding outputs:
The problem arises when using:
The tasks I am working on is:
Purpose: Create two simple dummy events and test whether the transition probability scores are the same as the probability sequence scores. The goal is to understand what the sequence scores represent (are they unnormalized by length? normalized by length?). From the PR #14654 discussion, I had the impression that it would be enough to sum the transition probabilities but these do not seem to match. Would you please help me understand?
To reproduce
Steps to reproduce the behavior:
model
) and generate the output scores using beam search.model.compute_transition_beam_scores
BeamSearchEncoderDecoderOutput.sequence_scores
.From the example above I deduced that in order to obtain the same scores as those computed in the
sequences_scores
it would suffice to divide by the length of the sentences. In this case, it seems to work nicely because both sequences have the same length:So I tried a different example, that would cause the beamsearch_results.sequences to be different:
The output of
beamsearch_results.sequences
for the above example is:The difference from
Sum/length
toSum/rel_length
is that in the former I divide by the maximum length of the generated sentences, whereas the previous is divided by the number of non-zero transition probabilities. We can see that for the latter case, (i.e., when dividing by the relative length) only the first example score is matched to the originalbeamsearch_results.sequences_scores
).Will you please help me better understand the computation of these probabilities and their connection with the sequence_scores? In particular, are the individual scores returned by the
compute_transition_beam_scores
length-normalized ? Do these individual scores aim to represent the joint probability or are they representing the individual probabilities? Are we supposed to consider the initial padding token when computing the scores?Thanks in advance for your time!