huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.92k stars 26.99k forks source link

seq2seq BertGeneration model failed "ValueError: You have to specify either input_ids or inputs_embeds" #10646

Closed gyin94 closed 3 years ago

gyin94 commented 3 years ago

Environment info

python examples/seq2seq/run_seq2seq.py \
    --model_name_or_path google/roberta2roberta_L-24_discofuse \
    --do_train \
    --do_eval \
    --task summarization \
    --train_file path_to_csv_or_jsonlines_file \
    --validation_file path_to_csv_or_jsonlines_file \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate \
    --max_train_samples 500 \
    --max_val_samples 500

path_to_csv_or_jsonlines_file:

text,summary
google map, gg map
google translate, gg translate

t5-small works perfectly. But BertGeneration model has the following error

error:

  File "/Users/gyin/Documents/working/transformers/src/transformers/models/bert_generation/modeling_bert_generation.py", line 361, in forward
    raise ValueError("You have to specify either input_ids or inputs_embeds")
ValueError: You have to specify either input_ids or inputs_embeds
patil-suraj commented 3 years ago

hi @gyin-ai

Thank you for reporting the issue. The run_seq2seq.py currently does not work for encoder-decoder models. This is because the encoder-decoder models expect both decoder_input_ids and labels whereas the script only passes the labels. Which is causing the above error.

You could refer to this notebook to see how to use Trainer for encoder-decoder models. Also, you easily adapt the run_seq2seq.py script for this, I think you'll only need to change the data collator here to return both the labels and decoder_input_ids

gyin94 commented 3 years ago

@patil-suraj can I ask whether batch["decoder_input_ids"] should be inputs.input_ids instead of outputs.input_ids?

def process_data_to_model_inputs(batch):                                                               
    # Tokenizer will automatically set [BOS] <text> [EOS]                                               
    inputs = tokenizer(batch["document"], padding="max_length", truncation=True, max_length=encoder_max_length)
    outputs = tokenizer(batch["summary"], padding="max_length", truncation=True, max_length=decoder_max_length)

    batch["input_ids"] = inputs.input_ids                                                               
    batch["attention_mask"] = inputs.attention_mask                                                     
    batch["decoder_input_ids"] = outputs.input_ids                                                      
    batch["labels"] = outputs.input_ids.copy()                                                          
    # mask loss for padding                                                                             
    batch["labels"] = [                                                                                 
        [-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
    ]                     
    batch["decoder_attention_mask"] = outputs.attention_mask                                                                              

    return batch  

here is the example from EncoderDecoderModel

>>> from transformers import EncoderDecoderModel, BertTokenizer
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints

>>> # forward
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
>>> outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)

>>> # training
>>> outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, labels=input_ids)
patil-suraj commented 3 years ago

The labels and decoder_input_ids always correspond to output. so it should be outputs.input_ids