Open dpernes opened 8 months ago
cc @zucchini-nlp
@dpernes Hi, if you want to specify in different decoder_start_token_ids for each element, you can do it by passing a tensor of shape (batch_size, seq_len)
. In your case adding this line before the generate
is called will solve the issue:
decoder_start_token_id = decoder_start_token_id.unsqueeze(1) # shape (num_target_languages, 1)
Great, thank you @zucchini-nlp! This behavior is not documented, though:
decoder_start_token_id (`int`, *optional*):
If an encoder-decoder model starts decoding with a different token than *bos*, the id of that token.
You may want to change it to something like:
decoder_start_token_id (`Union[int, torch.LongTensor]`, *optional*):
If an encoder-decoder model starts decoding with a different token than *bos*, the id of that token. Optionally, use a `torch.LongTensor` of shape `(batch_size, sequence_length)` to specify a prompt for the decoder.
But why isn't this the same as passing decoder_input_ids
to generate
? I tried passing the same tensor as decoder_input_ids
instead of decoder_start_token_id
and the results do not match.
Thanks, I added a PR extending the docs.
Regarding your question, there is a subtle difference between them. The decoder_start_token_id
is used as the very first token in generation, BOS
token in most cases. But decoder_input_ids
are used to start/continue the sentence from them. In most cases you do not provide decoder_input_ids
yourself when calling generate
, so they will be filled with decoder_start_token_id
to start generation from BOS
.
The general format is [decoder_start_token_id, decoder_input_ids]
and the generate
automatically fills in decoder_start_token_id
from config if you do not provide them.
Hi,
Is there any way to specify decoder_start_token_id
during training as well?
Like
outputs = model(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
labels=batch["labels"],
decoder_start_token_id=decoder_start_token_id,
)
loss = outputs.loss
Each batch may require a different decoder_start_token_id during training. This is because each batch has a specific input language and output language. Sometimes, the output language is model.config.decoder_start_token_id
per each batch doesn't seem to be a good approach. Specifically, it seems it causes lots of inconsistency when using Accelerator with DeepSpeed.
Hey @tehranixyz , you do not need to specify decoder_start_token_ids
while training. All you need is to prepare the decoder_input_ids
and pass it to the forward. We use the start token from model config only when we do not find decoder_input_ids
from the user (see code snippet for preparing decoder input ids from labels)
Gotcha!
I was a bit confused by the warning saying
The decoder_input_ids are now created based on the "labels", no need to pass them yourself anymore.
when using EncoderDecoderModel.
So in my case, I guess, as you said, I have to prepare decoder_input_ids
myself by shifting labels and adding the appropriate start_token
at the beginning.
Many thanks!
Feature request
@gante The
generate
function has adecoder_start_token_id
argument that allows the specification of the decoder start token when generating from an encoder-decoder model (e.g. mT5). Currently,decoder_start_token_id
must be an integer, which means that the same start token is used for all elements in the batch. I request that you allow the specification of different start tokens for each element of the batch. For this purpose,decoder_start_token_id
must be a tensor with shape(batch_size,)
.Motivation
Some multilingual encoder-decoder models use the
decoder_start_token_id
to indicate the target language. Thus, this change would allow generation into multiple target languages in parallel, as illustrated in the code below.Your contribution