huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.89k stars 26.99k forks source link

Questions on generating using encoder-decoder models #13213

Closed devjwsong closed 3 years ago

devjwsong commented 3 years ago

Hi, I want to conduct a Grammatical Error Correction task with BART, which takes corrupted sentences as inputs and make corrected answers as outputs. The model I'm using is BartForConditionalGeneration.

I want to ask several things.

  1. What is the difference between decoder_input_ids and labels? The doc says, when handling seq2seq problems such as translation or summarization, decoder_input_ids should be given, otherwise the model just put the shifted encoder input into the decoder, which is not the desired process. However, there is another argument labels and I think I should give the answer sequence as labels to get the loss. And according to here, I assume that BART takes the answer outputs as labels. Then what is decoder_input_ids? Is this not necessary when using model.forward function to train the model?

  2. Should I pad the decoder inputs with -100? According to the doc, to make the loss function ignore the unwanted locations, it should be set to -100. But I want to make it ignore the pad token. Should I just set the pad token as -100 or is there any way to make the loss function ignore the value I set?

  3. Unlike the training, inference does not require the answers. However, like I mentioned above, if the model is not given decoder_input_ids or labels, then the model put the shifted inputs into the decoder. But this is not what we want. The decoder should start only with the start token at first. Then is it right to use model.generate not model.forward function without any decoder inputs given? I think I should use model.generate when inferencing but I want to make sure that model.generate(input_ids=input_ids) works as I described, which gives only the start token in the beginning. In fact, like the image below, it seems the input ids might be just copied judging by the values. So I'm worried if the decoder just took the input ids. image

  4. According to this, BART was pretrained to use EOS token as the start token of the decoder. I don't know why it should be, but anyway like the above image, we can see that all outputs start with both EOS and BOS token. Then may I assume that the model put both EOS and BOS token as the starting sign?

  5. The last question is about beam search. I want to get the last hidden state from the decoder to conduct multi-task learning combined with LM and sentence classification. But when using the beam search, the shape of one tensor from decoder_hidden_states becomes (batch_size*num_beams*num_return_sequences, generated_length, hidden_size). Then how can we know which one is from the best result?

Thank you for reading this long questions.

NielsRogge commented 3 years ago

Hi,

encoder-decoder models like T5 and BART create the decoder_input_ids automatically based on the labels you provide. So you should only provide the encoder inputs (input_ids, attention_mask, possibly token_type_ids) and the decoder targets (labels). As you can see here, BartForConditionalGeneration will automatically create the decoder_input_ids by shifting the labels one position to the right.

Let's consider what happens with a small example. Suppose we want to train BART for translation, and we have:

=> to prepare this example for BartForConditionalGeneration, we can use BartTokenizer. We can prepare the input for BART by encoding the input sequence, like so:

from transformers import BartTokenizer

tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")

input_sequence = "HuggingFace is a company based in New York and Paris."
encoding = tokenizer(input_sequence, return_tensors="pt")
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask

To create the labels, we can also use BartTokenizer. The labels are just the input_ids from the encoding of the target sequence:

target_sequence = "HuggingFace est une société basée à New York et à Paris."
target_encoding = tokenizer(target_sequence, return_tensors="pt")
labels = target_encoding.input_ids

Now we have everything we need to do a forward pass and obtain a loss, like so:

from transformers import BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")

outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
print(loss.item())

We can also check how these labels look like in text, by decoding them:

for id in labels.squeeze().tolist():
  print(id, tokenizer.decode([id]))

# this prints:
0 <s>
40710 Hug
3923 ging
34892 Face
3304  est
12515  une
17380  soc
118 i
10221 ét
1140 é
11909  bas
9703 ée
6534  à
188  New
469  York
4400  et
6534  à
2201  Paris
4 .
2 </s>

What internally happens, is that first the encoded input sequence (i.e. the input_ids and attention_mask) are forwarded through the encoder of BART. The encoder will output a tensor of shape (batch_size, sequence_length, hidden_size). In this case, we only have a single example which means that the batch size is 1, the sequence length (which is the number of tokens) is equal to len(input_ids) = len(attention_mask), which in this case is 15 tokens, and the hidden size of BART-large is 1024 (BART-base would be 768). So the encoder will output a tensor of shape (1, 15, 1024). This tensor is often refered to as the "last hidden states", as these are the hidden representations for all tokens from the last layer of the encoder.

Next, we have the decoder. The decoder needs to spit out the desired input_ids of the target sequence (in other words, the labels). The decoder of BART (and T5) is autoregressive, which is a fancy term to say "from left to right". So what happens is, we provide the first decoder_input_id to the decoder (which is the decoder_start_token_id, which for BART is equal to the \ token). Then, the decoder outputs a probability over all possible input_ids, and this is compared to the first label (which will be the first input_id of the labels we created, i.e. the \ token). Next, we provide the first two decoder input ids, i.e. \ \ to the decoder, and then it needs to spit out the first two labels, i.e. \ Hug. Next, we provide the first three decoder input ids, i.e. \ \ Hug to the decoder, and then it needs to spit out the first three labels, i.e. \ Hug ging, and so on.

NOTE: this was just a single example. In practice, deep learing models are always trained in batches. As the input_ids and labels have different lengths for each example in the batch, we use padding and truncation to make sure they are all of the same length. One typically defines a max_source_length and max_target_length as hyperparameters, and then prepares all data like so:

# encode the inputs
encoding = tokenizer(text, padding="max_length", max_length=max_source_length, truncation=True, return_tensors="pt")
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask

# encode the labels
target_encoding = tokenizer(text, padding="max_length", max_length=max_target_length, truncation=True, return_tensors="pt")
labels = target_encoding.input_ids

An additional thing to keep in mind is to replace padding tokens of the labels by -100, such that they are not taken into account by the loss function. For that, I use the following code (assuming the labels of a batch are still lists rather than PyTorch tensors):

labels_with_ignore_index = []
for labels_example in labels:
    labels_example = [label if label != tokenizer.pad_token_id else -100 for label in labels_example]
    labels_with_ignore_index.append(labels_example)

Regarding your third question, yes, during inference one should use model.generate instead of model.forward. Check out this blog post to know all the details about generating after training your model.

devjwsong commented 3 years ago

I really appreciate with your help. About the last question, I think I can get the desired last decoder hidden states based on output scores.

Thank you so much and have a nice day.

ZiyueWangUoB commented 3 years ago

@NielsRogge

Hi Niels,

I'm new to NLP and was reading this to try and further understand the BART model for seq2seq summarization. As you said above, the encoder outputs a tensor of the shape (batch_size, sequence_length, hidden_size) , and the decoder then goes and generate probabilities over all the input_ids. The decoder now outputs the softmax result, in the shape of (batch_size, sequence_length, hidden_size). However, as I'm trying provide summarization, I want to convert this result into text. I understand greedy and beam searching, but am unsure of how to get to the generated text from the decoder's last_hidden_state.

Any help would be much appreciated. Thanks in advance.

NielsRogge commented 3 years ago

The decoder of BartModel outputs a tensor of shape (batch_size, sequence_length, hidden_size), indeed (no softmax is involved). Next, the language modeling head that BartForConditionalGeneration places on top of the decoder will transform this into a tensor (usually called logits) of shape (batch_size, sequence_length, vocab_size).

To know which tokens BART predicts, you can apply an argmax on the last dimension, i.e. logits.argmax(dim=-1). This will give you a new tensor of shape (batch_size, sequence_length), containing the token IDs as predicted by BART.

However, at inference time, it's recommended to use the generate() method, which will autoregressively (i.e. from left to right) predict token ids. There are several decoding strategies available, such as greedy decoding, beam search, top-k sampling, etc. Let's take an example:

from transformers import BartTokenizer, BartForConditionalGeneration

tokenizer = BartTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")
model = BartForConditionalGeneration.from_pretrained("sshleifer/distilbart-cnn-12-6")

text = """The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."""

# prepare text for model
encoding = tokenizer(text, return_tensors="pt")

# generate IDs autoregressively
predicted_ids = model(**encoding)

# decode IDs back to text
generated_text = tokenizer.batch_decode(predicted_ids)[0]
print(generated_text)
ZiyueWangUoB commented 3 years ago

@NielsRogge Yes that's what I used at the start. The problem lies in the fact that I want to convert my model to onnx, where the generate function is not available. I guess I will have to write my own greedy decoding method.

NielsRogge commented 3 years ago

We've actually just added an example of converting BART to ONNX, including beam search generation. However, the example doesn't include a README right now, it will be added soon.

vsocrates commented 6 months ago

@NielsRogge I don't want to revive an old thread, but this seems like a relevant place to follow-up with this question. I'm having trouble extending the decoder output to beyond 1024 (max length for BART), even though the model is supposed to generate tokens autoregressively and so should not have a limit.

Does this have to do with the (batch_size, sequence_length, hidden_size) output shape including the sequence_length? Does it have to do with positional embeddings being limited?