huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

Error with T5 model: Output is always getting truncated with 20 tokens #13424

Closed nbravulapalli closed 2 years ago

nbravulapalli commented 2 years ago

Environment info

Who can help

@patil-suraj @patrickvonplaten @sgugger

Information

Model I am using (Bert, XLNet ...): T5

The problem arises when using:

The tasks I am working on is:

To reproduce

I have fine-tuned a T5-small model for key-phrase extraction with the Trainer API by passing it an input paragraph and training it to output the same paragraph, but with the key-phrases surrounded in ‘|||’ (ex. |||George Washington||| was a president and ....). However, when I try to make a prediction with the model, the output always contains exactly 20 tokens, and as a result, the output is cut-off mid sentence. When tokenizing the training, validation, and testing sets, I set the max_length parameter to 512. I do not not know why every single output is only 20 tokens. (My input data is much longer). Aside from the output being chopped off, the model seems to be fine (the '|||' is showing up in some of the predicted outputs despite the output length being only 20).

Steps to reproduce the behavior:

The function I use to tokenize my custom dataset (this is not using the HuggingFace Dataset class):

def tokenize_dataset(dataset):
  tokenized_dataset = []
  for input, output in tqdm(dataset.items()):
    processed_input = t5_tokenizer(f"input: {input} </s>", padding = 'max_length', truncation = True, return_tensors="pt", max_length=512)
    processed_output = t5_tokenizer(f"output: {output} </s>",padding = 'max_length', truncation = True , return_tensors="pt", max_length=512)
    labels = copy.deepcopy(processed_output['input_ids'].squeeze())
    labels [labels==0] = -100

    tokenized_dataset.append({'input_ids': processed_input['input_ids'].squeeze(), 'attention_mask': processed_input['attention_mask'].squeeze(),
            'labels': labels})
  return tokenized_dataset

This is my code to train the model:

training_args = TrainingArguments(output_dir="t5smallcav12", logging_dir = "t5smallcav12/runs", evaluation_strategy="steps", logging_strategy="steps", save_strategy="steps"
)
trainer = Trainer(
    t5_model,
    training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=t5_tokenizer
)
trainer.train()
trainer.save_model("t5smallcav12")

mymod = T5ForConditionalGeneration.from_pretrained(pretrained_model_name_or_path="t5smallcav12/")
toker = T5Tokenizer.from_pretrained(pretrained_model_name_or_path="t5smallcav12/")

def tokenize_one(input):
  processed_input = toker(f"input: {input} </s>", padding = 'max_length', truncation = True, return_tensors="pt", max_length=512)
  processed_input.to('cuda')
  return processed_input

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print ("device ",device)
mymod = mymod.to(device)

with torch.no_grad():
  input, output = list(val_mapping.items())[302]
  print(input)
  print(output)
  print(toker.decode(mymod.generate(input_ids = tokenize_one(input)['input_ids'], attention_mask =  tokenize_one(input)['attention_mask'])[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))

print(len(mymod.generate(input_ids = tokenize_one(input)['input_ids'], attention_mask =  tokenize_one(input)['attention_mask'])[0]))

The output from the last line is always 20, no matter with sample I use from val_mapping.

EDIT

After looking at this issue on the HuggingFace forum I found out that my mymod.config.max_length was 20. After manually reassigning this value to 512, my problem was solved. However, I still have know idea why this was 20 in the first place (I didn't add this). Feel free to close this issue. I only leave it open so that the general issue could be addressed.

patil-suraj commented 2 years ago

Glad that found the solution.

The default max_length for generate is set to 20. The max_length really depends on the task/problem, so it should be set either in config or passed to generate.