Closed LarsHill closed 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey @LarsHill,
Thanks a lot for the feedback and I'm very sorry to be so late to answer here.
Very good point! It's a tricky thing to solve because:
generate()
and this use-case is quite edge-casy, model-specificconfig.encoder
variable so we quickly arrive at 2, 3 if statements.
I think for now the best we can do here is to write some nice docs that explain how to use the Encoder-Decoder architecture. E.g. if we would put a warning on the EncoderDecoderModel
card and write a nice How-to-guide
, I'm quite sure that people won't run into this issue too much. Also it's not really a bug in the code, but something the user should be made aware of IMO.
Ideally, the user passes an attention_mask
in which case there is no need to have the padding token correctly defined. Also a good point, but this behavior really should not occur. The only time there could be padding token ids in the decoder_input_ids tensor is when the user passes those her/himself and that's quite an edge-case for encoder-decoder models. I.E. passing both input_ids
(like an article to summarize) and a prompt that should be used to start the summarization with is still quite an edge case for me. In this case, the user should (again) also pass the decoder_attention_mask so that this problem would be solved correctly IMO. Note that no model should generate a pad_token_id -> this would be a modeling bug and then we really can't expect the model to generate anything useful at all.
3.Good point. I agree that we should probably align the methods, but it's quite difficult now because of possible backward breaking changes. We've more or less decided to not automatically create the attention_mask if a padding token is provided because:
However, I do agree that in 95% of the cases the padding token should be masked, so it does make sense to create a warning in every model if no attention_mask is provided but the input_ids contain the padding token.
@LarsHill - would you maybe be interested in tackling this issue by improving the docs a bit: https://github.com/huggingface/transformers/issues/16135 ? :-)
Added some action items following the discussion here:
@LarsHill - would you maybe be interested in tackling this issue by improving the docs a bit: #16135 ? :-)
Hi, First of all, thanks for the extensive reply! I agree, that most of my concerns could be tackled by improving the documentation. After all, it is not code breaking since there are work arounds. It is just, that I ran into some of the mentioned problems myself and had to deeply check the code base to understand what was going on.
Regarding, contributing to the documentation, I cannot promise to get on it any time soon, since I'm quite occupied with project and research work at the moment. But I keep this issue in mind and get back to it. If noone else contributed in the meantime I'll take a shot.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi guys,
When creating my own EncoderDecoder Abstraction to have more flexibility initializing custom huggingface based models as encoder and decoder, I noticed a couple of issues that imho should be fixed/changed.
I think this is most relevant to @patrickvonplaten since he seems to be in charge of
EncoderDecoderModel
inmodeling_encoder_decoder.py
andgeneration_utils.py
. I tag @LysandreJik as well since I detected some other potential improvement related toBERT
(maybe also other models if the behaviour is the same).Problem: When using
EncoderDecoderModel
the underlying encoder and decoder could have different tokenizers and thus differentpad_token_id
. That meansself.config.encoder.pad_token_id
andself.config.decoder.pad_token_id
might be different. When generating with anencoder_decoder_instance
and noattention_mask
is provided to thegenerate
function, the attention mask is internally created. This happens in_prepare_attention_mask_for_generation()
ingeneration_utils.py
. However, this function does not distinguish the encoder-decoder vs decoder only case. Hence, it uses thepad_token_id
that is registered inself.config.pad_token_id
. This can cause a problem ifself.config.pad_token_id
is not equal toself.config.encoder.pad_token_id
. Proposed Solution: Imho_prepare_attention_mask_for_generation()
should check forself.config.is_encoder_decoder
and if true tha padding token should be taken fromself.config.encoder.pad_token_id
instead ofself.config.pad_token_id
.Problem: The decoder attention mask is created/updated on each generation step in
generate()
by callingprepare_inputs_for_generation()
which is implemented in the corresponding model instance, e.g. encoder_decoder, bert, etc. However,BERT
implements this function to simply create an all-ones mask that mimics the shape of the currentinput_ids
, irrespective of previously predicted ids. Assuming that at some point in the generation process apad_token_id
is predicted, the attention mask update should take this into account and place a 0 at that position in the mask. Proposed Solution: All models that implementprepare_inputs_for_generation()
should imho take their correspondingpad_token_id
ininput_ids
into account when updating theattention_mask
between generation steps.Problem : The
attention_mask
creation in e.g.BERT
and ingenerate()
is not aligned if the user does not provide a mask himself.BERT
(and maybe other models) simply create a mask of all-ones (same shape asinput_ids
). As described in 1.generate()
takes thepad_token_id
into account and creates a proper mask based oninput_ids
. At the moment I don't need to provide the mask togenerate()
but I have to provide the mask to a simpleforward()
during training because if it is not provided an all-ones mask is created. I feel this should be aligned -> Model creates the correct mask internally if not provided. Proposed Solution: Imho each model should create the correctattention_mask
if the user does not provide any. Thepad_token_id
is known to the model, so implementing this should be no problem.Some feedback about these thoughts would be great. Maybe I missed something in the code base and the issues are not relevant afterall.
Thanks for your work and I'm looking forward hearing from you.
Lars