Closed devjwsong closed 3 years ago
Hi,
encoder-decoder models like T5 and BART create the decoder_input_ids
automatically based on the labels
you provide. So you should only provide the encoder inputs (input_ids
, attention_mask
, possibly token_type_ids
) and the decoder
targets (labels
). As you can see here, BartForConditionalGeneration
will automatically create the decoder_input_ids
by shifting the labels
one position to the right.
Let's consider what happens with a small example. Suppose we want to train BART for translation, and we have:
=> to prepare this example for BartForConditionalGeneration
, we can use BartTokenizer
. We can prepare the input for BART by encoding the input sequence, like so:
from transformers import BartTokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
input_sequence = "HuggingFace is a company based in New York and Paris."
encoding = tokenizer(input_sequence, return_tensors="pt")
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
To create the labels, we can also use BartTokenizer
. The labels are just the input_ids
from the encoding of the target sequence:
target_sequence = "HuggingFace est une société basée à New York et à Paris."
target_encoding = tokenizer(target_sequence, return_tensors="pt")
labels = target_encoding.input_ids
Now we have everything we need to do a forward pass and obtain a loss, like so:
from transformers import BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
print(loss.item())
We can also check how these labels look like in text, by decoding them:
for id in labels.squeeze().tolist():
print(id, tokenizer.decode([id]))
# this prints:
0 <s>
40710 Hug
3923 ging
34892 Face
3304 est
12515 une
17380 soc
118 i
10221 ét
1140 é
11909 bas
9703 ée
6534 à
188 New
469 York
4400 et
6534 à
2201 Paris
4 .
2 </s>
What internally happens, is that first the encoded input sequence (i.e. the input_ids
and attention_mask
) are forwarded through the encoder of BART. The encoder will output a tensor of shape (batch_size, sequence_length, hidden_size)
. In this case, we only have a single example which means that the batch size is 1, the sequence length (which is the number of tokens) is equal to len(input_ids) = len(attention_mask)
, which in this case is 15 tokens, and the hidden size of BART-large is 1024 (BART-base would be 768). So the encoder will output a tensor of shape (1, 15, 1024). This tensor is often refered to as the "last hidden states", as these are the hidden representations for all tokens from the last layer of the encoder.
Next, we have the decoder. The decoder needs to spit out the desired input_ids
of the target sequence (in other words, the labels
). The decoder of BART (and T5) is autoregressive, which is a fancy term to say "from left to right". So what happens is, we provide the first decoder_input_id
to the decoder (which is the decoder_start_token_id
, which for BART is equal to the \ token). Then, the decoder outputs a probability over all possible input_ids
, and this is compared to the first label (which will be the first input_id of the labels we created, i.e. the \ token). Next, we provide the first two decoder input ids, i.e. \ \ to the decoder, and then it needs to spit out the first two labels, i.e. \ Hug. Next, we provide the first three decoder input ids, i.e. \ \ Hug to the decoder, and then it needs to spit out the first three labels, i.e. \ Hug ging, and so on.
NOTE: this was just a single example. In practice, deep learing models are always trained in batches. As the input_ids and labels have different lengths for each example in the batch, we use padding and truncation to make sure they are all of the same length. One typically defines a max_source_length
and max_target_length
as hyperparameters, and then prepares all data like so:
# encode the inputs
encoding = tokenizer(text, padding="max_length", max_length=max_source_length, truncation=True, return_tensors="pt")
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# encode the labels
target_encoding = tokenizer(text, padding="max_length", max_length=max_target_length, truncation=True, return_tensors="pt")
labels = target_encoding.input_ids
An additional thing to keep in mind is to replace padding tokens of the labels by -100, such that they are not taken into account by the loss function. For that, I use the following code (assuming the labels
of a batch are still lists rather than PyTorch tensors):
labels_with_ignore_index = []
for labels_example in labels:
labels_example = [label if label != tokenizer.pad_token_id else -100 for label in labels_example]
labels_with_ignore_index.append(labels_example)
Regarding your third question, yes, during inference one should use model.generate
instead of model.forward
. Check out this blog post to know all the details about generating after training your model.
I really appreciate with your help. About the last question, I think I can get the desired last decoder hidden states based on output scores.
Thank you so much and have a nice day.
@NielsRogge
Hi Niels,
I'm new to NLP and was reading this to try and further understand the BART model for seq2seq summarization. As you said above, the encoder outputs a tensor of the shape (batch_size, sequence_length, hidden_size)
, and the decoder then goes and generate probabilities over all the input_ids
. The decoder now outputs the softmax result, in the shape of (batch_size, sequence_length, hidden_size)
. However, as I'm trying provide summarization, I want to convert this result into text. I understand greedy and beam searching, but am unsure of how to get to the generated text from the decoder's last_hidden_state
.
Any help would be much appreciated. Thanks in advance.
The decoder of BartModel
outputs a tensor of shape (batch_size, sequence_length, hidden_size)
, indeed (no softmax is involved). Next, the language modeling head that BartForConditionalGeneration
places on top of the decoder will transform this into a tensor (usually called logits) of shape (batch_size, sequence_length, vocab_size)
.
To know which tokens BART predicts, you can apply an argmax on the last dimension, i.e. logits.argmax(dim=-1)
. This will give you a new tensor of shape (batch_size, sequence_length)
, containing the token IDs as predicted by BART.
However, at inference time, it's recommended to use the generate()
method, which will autoregressively (i.e. from left to right) predict token ids. There are several decoding strategies available, such as greedy decoding, beam search, top-k sampling, etc. Let's take an example:
from transformers import BartTokenizer, BartForConditionalGeneration
tokenizer = BartTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")
model = BartForConditionalGeneration.from_pretrained("sshleifer/distilbart-cnn-12-6")
text = """The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."""
# prepare text for model
encoding = tokenizer(text, return_tensors="pt")
# generate IDs autoregressively
predicted_ids = model(**encoding)
# decode IDs back to text
generated_text = tokenizer.batch_decode(predicted_ids)[0]
print(generated_text)
@NielsRogge Yes that's what I used at the start. The problem lies in the fact that I want to convert my model to onnx, where the generate
function is not available. I guess I will have to write my own greedy decoding method.
We've actually just added an example of converting BART to ONNX, including beam search generation. However, the example doesn't include a README right now, it will be added soon.
@NielsRogge I don't want to revive an old thread, but this seems like a relevant place to follow-up with this question. I'm having trouble extending the decoder output to beyond 1024
(max length for BART), even though the model is supposed to generate tokens autoregressively and so should not have a limit.
Does this have to do with the (batch_size, sequence_length, hidden_size)
output shape including the sequence_length
? Does it have to do with positional embeddings being limited?
Hi, I want to conduct a Grammatical Error Correction task with BART, which takes corrupted sentences as inputs and make corrected answers as outputs. The model I'm using is
BartForConditionalGeneration
.I want to ask several things.
What is the difference between
decoder_input_ids
andlabels
? The doc says, when handling seq2seq problems such as translation or summarization,decoder_input_ids
should be given, otherwise the model just put the shifted encoder input into the decoder, which is not the desired process. However, there is another argumentlabels
and I think I should give the answer sequence aslabels
to get the loss. And according to here, I assume that BART takes the answer outputs aslabels
. Then what isdecoder_input_ids
? Is this not necessary when usingmodel.forward
function to train the model?Should I pad the decoder inputs with
-100
? According to the doc, to make the loss function ignore the unwanted locations, it should be set to-100
. But I want to make it ignore the pad token. Should I just set the pad token as-100
or is there any way to make the loss function ignore the value I set?Unlike the training, inference does not require the answers. However, like I mentioned above, if the model is not given
decoder_input_ids
orlabels
, then the model put the shifted inputs into the decoder. But this is not what we want. The decoder should start only with the start token at first. Then is it right to usemodel.generate
notmodel.forward
function without any decoder inputs given? I think I should usemodel.generate
when inferencing but I want to make sure thatmodel.generate(input_ids=input_ids)
works as I described, which gives only the start token in the beginning. In fact, like the image below, it seems the input ids might be just copied judging by the values. So I'm worried if the decoder just took the input ids.According to this, BART was pretrained to use EOS token as the start token of the decoder. I don't know why it should be, but anyway like the above image, we can see that all outputs start with both EOS and BOS token. Then may I assume that the model put both EOS and BOS token as the starting sign?
The last question is about beam search. I want to get the last hidden state from the decoder to conduct multi-task learning combined with LM and sentence classification. But when using the beam search, the shape of one tensor from
decoder_hidden_states
becomes(batch_size*num_beams*num_return_sequences, generated_length, hidden_size)
. Then how can we know which one is from the best result?Thank you for reading this long questions.