Closed parambharat closed 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@patrickvonplaten : is this something I can contribute towards ?
Hey @parambharat,
It would be great if you could try to fix the problem by opening a PR! More than happy to take a look :-)
@ydshieh think you've worked with LED quite a bit recently. Could you take a look here? :-)
Sure, I will try it 😎
Hi @parambharat , @patrickvonplaten
I looked this issue, and think @parambharat 's suggestion makes sense, but need to be refined:
In the method _pad
, just like attention_mask
, we need to deal with the case "global_attention_mask" not in encoded_inputs
:
global_attention_mask
to the outputsIf we go for option 1, I think it is more logical to include global_attention_mask
in the output of encode
and other tokenizer method. But I prefer not to override many methods. So I will go for option 2 (i.e. not return global_attention_mask
if it is not provided by the user).
I understand the issue now much better - thanks for clarifying! IMO it's a good idea to overwrite the _pad
method in the tokenizer and I agree with @ydshieh that option 2.) is simpler makes more sense here! @parambharat would you be interested in opening a PR here or @ydshieh maybe? :-)
Let's see if @parambharat would like (or have time) to contribute first. Otherwise, I can work on it.
@parambharat
This issue is finally fixed in #15940.
Environment info
transformers
version: 4.13.0.dev0Who can help
Models: @patrickvonplaten
Information
Model I am using (LEDTokenizer, LEDSeq2SeqLM):
The problem arises when using:
[* ] my own modified scripts:
I'm trying to finetune a LEDSeq2SeqLM model for abstractive summarization of long-speech-transcripts.
I followed the code here
I modified it to use dynamic padding by first not padding and then by using DataCollatorForSeq2Seq for dynamic padding
To reproduce
Steps to reproduce the behavior:
The error was not very helpful since the tokenizer is instantiated with padding and truncation and max_length params. The problem was with the
global_attention_mask
not being padded in theDataLoader
hereI was able to pinpoint the problem to the padding in the tokenizer. Where model inputs to be padded doesn't include
global_attention_mask
https://github.com/huggingface/transformers/blob/75ae287aecf20a37c232a41e25443a3421a8b5e2/src/transformers/tokenization_utils_base.py#L3121-L3150i changed lines L3128-L3131 (https://github.com/huggingface/transformers/blob/75ae287aecf20a37c232a41e25443a3421a8b5e2/src/transformers/tokenization_utils_base.py#L3128-L3131) to the following and everything worked.
Expected behavior
My fix was a hack for sure. We could perhaps override the
_pad
method in the LEDTokenizer to work with padding LED'sglobal_attention_mask
something like the following.