Error with T5 model: Output is always getting truncated with 20 tokens

Environment info

transformers version: 4.10.0
Platform: Google Colab
Python version: 3.7.11
PyTorch version (GPU?): 1.9.0+cu102
Tensorflow version (GPU?): I am not using tensorflow
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@patil-suraj @patrickvonplaten @sgugger

Information

Model I am using (Bert, XLNet ...): T5

The problem arises when using:

[ ] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below)

To reproduce

I have fine-tuned a T5-small model for key-phrase extraction with the Trainer API by passing it an input paragraph and training it to output the same paragraph, but with the key-phrases surrounded in ‘|||’ (ex. |||George Washington||| was a president and ....). However, when I try to make a prediction with the model, the output always contains exactly 20 tokens, and as a result, the output is cut-off mid sentence. When tokenizing the training, validation, and testing sets, I set the max_length parameter to 512. I do not not know why every single output is only 20 tokens. (My input data is much longer). Aside from the output being chopped off, the model seems to be fine (the '|||' is showing up in some of the predicted outputs despite the output length being only 20).

Steps to reproduce the behavior:

The function I use to tokenize my custom dataset (this is not using the HuggingFace Dataset class):

def tokenize_dataset(dataset):
  tokenized_dataset = []
  for input, output in tqdm(dataset.items()):
    processed_input = t5_tokenizer(f"input: {input} </s>", padding = 'max_length', truncation = True, return_tensors="pt", max_length=512)
    processed_output = t5_tokenizer(f"output: {output} </s>",padding = 'max_length', truncation = True , return_tensors="pt", max_length=512)
    labels = copy.deepcopy(processed_output['input_ids'].squeeze())
    labels [labels==0] = -100

    tokenized_dataset.append({'input_ids': processed_input['input_ids'].squeeze(), 'attention_mask': processed_input['attention_mask'].squeeze(),
            'labels': labels})
  return tokenized_dataset

This is my code to train the model:

training_args = TrainingArguments(output_dir="t5smallcav12", logging_dir = "t5smallcav12/runs", evaluation_strategy="steps", logging_strategy="steps", save_strategy="steps"
)
trainer = Trainer(
    t5_model,
    training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=t5_tokenizer
)
trainer.train()
trainer.save_model("t5smallcav12")

mymod = T5ForConditionalGeneration.from_pretrained(pretrained_model_name_or_path="t5smallcav12/")
toker = T5Tokenizer.from_pretrained(pretrained_model_name_or_path="t5smallcav12/")

def tokenize_one(input):
  processed_input = toker(f"input: {input} </s>", padding = 'max_length', truncation = True, return_tensors="pt", max_length=512)
  processed_input.to('cuda')
  return processed_input

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print ("device ",device)
mymod = mymod.to(device)

with torch.no_grad():
  input, output = list(val_mapping.items())[302]
  print(input)
  print(output)
  print(toker.decode(mymod.generate(input_ids = tokenize_one(input)['input_ids'], attention_mask =  tokenize_one(input)['attention_mask'])[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))

print(len(mymod.generate(input_ids = tokenize_one(input)['input_ids'], attention_mask =  tokenize_one(input)['attention_mask'])[0]))

The output from the last line is always 20, no matter with sample I use from val_mapping.

EDIT

After looking at this issue on the HuggingFace forum I found out that my mymod.config.max_length was 20. After manually reassigning this value to 512, my problem was solved. However, I still have know idea why this was 20 in the first place (I didn't add this). Feel free to close this issue. I only leave it open so that the general issue could be addressed.

huggingface / transformers