Where labels correspond to output_greedy, calculated as:
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length,
do_sample=False)
print(tokenizer.decode(output_greedy.squeeze()))
While it is clear the alignment between logits and labels (i.e., labels are shifted by 1), it is not clear to me why - when calculating the sequence log probability - we slice the log_probs tensor using input_len instead of input_len - 1 (see below)
Let me walk you through the example in the book in details:
input_ids is a 47-long sequence of tokens
output_greedy is a 128-long sequence of tokens (the original 47 input tokens + 81 model-generated tokens; max_length was set to 128)
When doing a forward pass using the model, i.e., outputs = model(output_greedy) the output (what is then passed as labels in sequence_logprob function) will include logits, whose dimension-1 is 128 (our max_length). We know that logits at index 0 actually refer to what would be the second token in our output sequence. Python 0-indexing is confusing here, but we can say that:
logit at index 0 in outputs.logits corresponds to the logits of the second word in our sequence
In other words, logit index >> (index + 2)th word in the sequence (where sequence is 1-indexed).
Following the same reasoning, we know that the first truly model-generated token (i.e., a token not present in the initial prompt) is the 48th word in the sequence, i.e., the (48 - 2)th logit in outputs.logits. We know that the model-generated text starts with The researchers, from the University of California and we can verify it by running:
Information
The question or comment is about chapter:
Question or comment
In section Beam Search Decoding of chapter 5, at page 132, the Authors include the following function for calculating a sequence log-probability:
Where
labels
correspond tooutput_greedy
, calculated as:While it is clear the alignment between logits and labels (i.e., labels are shifted by 1), it is not clear to me why - when calculating the sequence log probability - we slice the
log_probs
tensor usinginput_len
instead ofinput_len - 1
(see below)seq_log_prob = torch.sum(log_probs[:, input_len:])
Let me walk you through the example in the book in details:
input_ids
is a 47-long sequence of tokensoutput_greedy
is a 128-long sequence of tokens (the original 47 input tokens + 81 model-generated tokens;max_length
was set to 128)When doing a forward pass using the model, i.e.,
outputs = model(output_greedy)
the output (what is then passed aslabels
insequence_logprob
function) will include logits, whose dimension-1 is 128 (ourmax_length
). We know that logits at index 0 actually refer to what would be the second token in our output sequence. Python 0-indexing is confusing here, but we can say that:logit at index 0 in
outputs.logits
corresponds to the logits of the second word in our sequenceIn other words,
logit index >> (index + 2)th word
in the sequence (where sequence is 1-indexed). Following the same reasoning, we know that the first truly model-generated token (i.e., a token not present in the initial prompt) is the 48th word in the sequence, i.e., the(48 - 2)th
logit inoutputs.logits
. We know that the model-generated text starts with The researchers, from the University of California and we can verify it by running:tokenizer.decode(torch.argmax(output.logits[0,46:52], dim=-1))
In other words, the delta between logits indices and word position in output sequence is 2 because of:
As a consequence, the
sequence_logprob
function should be changed as follows:Am I missing something?