T5 truncation : `generate()` produce a tensor of maximum 115 length

pbrochar commented 2 years ago

Environment info

transformers version: 4.11.3
Platform: Linux-5.14.0-2-amd64-x86_64-with-glibc2.33
Python version: 3.9.7
PyTorch version (GPU?): 1.11.0.dev20211021+cu111 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no/i don't know

Who can help

encoder-decoder models : @patrickvonplaten, @patil-suraj

Information

Model I am using : T5-Base model for Translation task (en-fr)

The problem arises when using my own modified scripts:

from transformers import T5ForConditionalGeneration, T5Tokenizer

def translate_text_t5() -> None:
    """
    Use T5 model for translate a text.
    Model Pre-trained but not allready fine-tuned.
    """
    sentence = "I hope that a study of very long sentences will arm you with strategies that are almost as diverse as the sentences themselves, such as: starting each clause with the same word, tilting with dependent clauses toward a revelation at the end, padding with parentheticals, showing great latitude toward standard punctuation, rabbit-trailing away from the initial subject, encapsulating an entire life, and lastly, as this sentence is, celebrating the list."
    print(f"sentence len: {len(sentence)}")
    model = T5ForConditionalGeneration.from_pretrained("t5-base")
    tokenizer = T5Tokenizer.from_pretrained("t5-base")
    tokenizer.padding_side = "left"
    tokenizer.pad_token = tokenizer.eos_token
    #Set the task prefix.
    task_prefix = "translate English to French: "
    inputs= tokenizer(
        task_prefix + sentence,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt",
    )
    print(f"inputs tensor size : {len(inputs['input_ids'][0])}")
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=1024,
    )
    print(f"ouputs tensor size : {len(outputs[0])}")
    decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(decode)

translate_text_t5()

That produce the following ouput

sentence len: 453
inputs tensor size : 106
ouputs tensor size : 115
["J'espère qu'une étude de très longues phrases vous donnera des stratégies presque aussi diverses que les phrases elles-mêmes, comme : commencer chaque clause par le même mot, incliner les clauses dépendantes vers une révélation à la fin, rembourrer par des parenthèses, montrer une grande latitude envers la ponctuation standard, éloigner le lapin du sujet initial, encapsuler toute une vie, et"]

To reproduce

This is a minimal example of the script, copying it is enough for an example. Other sentences :

"Automatic extractive summarization generates a summary in which sentences are selected from the input article(s) and generated as they are, whereas automatic abstractive summarization engenders an abstract composed of rephrased sentences representing the same ideas/concepts of the source article(s) and more about complexity of the output of the previous managed systems and all the data of the world and of the galaxy."

"Given that much of the information has been extrapolated from what we know about other coronaviruses including severe acute respiratory syndrome coronavirus and Middle East respiratory syndrome coronavirus, we identify and provide insight into controversies and research gaps for the current pandemic to assist with future research ideas."

Expected behavior

The translated sentence is truncated. In the example, the end of the sentence is missing in the translation, the following words are not translated "[...]lastly, as this sentence is, celebrating the list." This happens with other large sentences. We found that the size of the output tensor is maximum 115.

Why is the output size of the tensor limited to 115? I know we could use LED or Longformer, but we would like to understand why this happens with large sentences, and to understand the proper workflow for long sentences with this model.

patrickvonplaten commented 2 years ago

The output lengths is not limited to 115 - it's simply that T5 just generates an EOS token after 115 tokens. So to make the output longer you could play around with things like some of generate arguments (check them here: https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate), such as:

min_length
num_beams
length_penalty

patrickvonplaten commented 2 years ago

In a first step I would try to set min_length to 120 to force the model to output longer sequences

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kmrniket commented 1 year ago

Yes, it needs to be addressed for flan T5 based models.

huggingface / transformers