joachim-dublineau commented 4 years ago

❓ Questions & Help

Details

How to train models for text generation

Hi Everyone,

I am trying to finetune an Encoder Decoder model on question generation task on SQuAD. Input data are a concatenation of answer span and context and outputs are the question.

inputs = tokenizer.encode_plus(example.answer, example.context, add_special_tokens=True, max_length=max_length, truncation='only_second')

label = tokenizer.encode_plus(example.question, add_special_tokens=True, max_length=max_length_label, truncation=True)

decoder_input_ids, label_ids = data_collator.mask_tokens(torch.tensor(label_ids).unsqueeze(0))

I add padding to all of these arguments if necessary and pass them to the model which can be:

an encoder decoder model: model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'token_type_ids' : batch[2], 'decoder_input_ids': batch[3], 'lm_labels' : batch[5]} outputs = model(**inputs)
a BART model : model = BartForConditionalGeneration.from_pretrained(model_name) inputs = {'input_ids': batch[0], 'attention_mask' : batch[1], 'decoder_input_ids': batch[2], 'labels' : batch[3] }

I thought that everything was alright and I started training my two models. As the training progressed, the mlm_probability of the datacollator object increased from 0.20 to 0.40 and then to 1. The learning rate and the optimizer are as follows: (lr around 3e-5)

optimizer = AdamW(model.parameters(), lr=learning_rate, eps=adam_epsilon) scheduler = get_cosine_with_hard_restarts_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total, num_cycles=num_cycles)

The eval loss was decreasing all along the 100 epochs for the BERT2BERT but it didn't looked like the questions were improving: epoch 50: what country did the french support in libya????? - 2013, 2014?? what country did nasser end to the coup? in 1989, 2007 and 2008 - 2011's what country did the us state have to use a particular prohibition of fuel in its oil? 2007

epoch 100: where was the fisafat for? islamic party in libya and al - farabut movement what did the unfyadi want to end in 1990? - 1991, 2003 and gulf what country did the oil industry stop its fuel and coaling? in a world, which countries

The observation remains the same for BART model: 100 steps: ? what was the name of normnormandy in frfrance. ? when did people in the first half of what began to give their ? who were the people that did not to swear fealty oath in

4 epochs: normnormnaandyanye gave given offered name namesNames to forfor normnormNormansons gave given granted their own original initial ancestral native normnormdonaldansons descended originated originate originating from origins origin?ers

My questions are: Do you think that something is wrong with my training? What do you think about the performances? Do you have any suggestions for the question generation task? How does the decoder input ids is supposed to change for Next word prediction loss ? Should I use Next word prediction loss or should I use Masked lm loss ? How to use dropout with pretrained model?

Thank you in advance for your help and I hope that my post will be usefull to others, if need be I can share a bigger part of my code :)

patil-suraj commented 4 years ago

Hey @joachim-dublineau , not a direct answer to your question, but here's a relevant discussion thread #4399

patil-suraj commented 4 years ago

And can you post the code where you prepare the decoder_input_ids and labels?

joachim-dublineau commented 4 years ago

Hi @patil-suraj ,

Thanks for your quick reply.

I have indeed seen this topic previously without finding answers to my points.

For the code, I use the datacollator (https://github.com/huggingface/transformers/blob/5f721ad6e48c9d846de25c3fefa0e50a306cbf10/src/transformers/data/data_collator.py) and its function mask_tokens(labels_ids)

patil-suraj commented 4 years ago

You won't need mask_tokens. mask_tokens is used for masked language modelling, it masks some tokens in the input, so maybe this why you are seeing the weird output.

For bart

source_ids, source_mask, y = batch["input_ids"], batch["attention_mask"], batch["decoder_input_ids"]
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone()
lm_labels[y[:, 1:] == pad_token_id] = -100

input_ids will be your tokenized context and decoder_input_ids will be tokenized question.

for enc-dec, you can pass the encoded input to input_ids and encoded question to decoder_input_ids and lm_labels

source_ids, source_mask, y = batch["input_ids"], batch["attention_mask"], batch["decoder_input_ids"]
model(input_ids=source_ids, decoder_input_ids=y, lm_labels=y)

Hope this is clear

joachim-dublineau commented 4 years ago

So I shouldn't use mask_tokens, ok thank you !

What I don't get is that if I provide the question in decoder_input_ids, first the decoder will have the ground truth and then why should I also use the labels argument?

And what is y in your first code ?

patil-suraj commented 4 years ago

What I don't get is that if I provide the question in decoder_input_ids, first the decoder will have the ground truth and then why should I also use the labels argument?

The EncoderDecoder model expects the input in this way. Basically it shifts the lm_labels or labels to the right. @patrickvonplaten is this correct ?

y is decoder input shifted to the right

joachim-dublineau commented 4 years ago

Thank you @patil-suraj ! I will implement this and keep this post updated.

volker42maru commented 4 years ago

I tried the same, using EncoderDecoder model for QG that I initialize from bert-base-uncased.

The model outputs somewhat readable questions:

what team won the 2015 nfl championship?
what team did the nfl win in the 2015 super bowl?
where was the super bowl held?
what team won the 2015 nfl championship?
what was the name of the team that was the first to be featured on the nfl network?
what was the name of the game that the nfl used to celebrate the 2015 super bowl?
when was the super bowl played?

However, the BLEU1 score is pretty low around 0.35.

I wonder if someone got better results with EncoderDecoder architecture. Otherwise BART will probably be better for the task.

joachim-dublineau commented 4 years ago

Hi @volker42maru, What parameters do you use for generation (Repetition Penalty and length penalty)? And for how long did you train your model ?

BART seems to be appropriate but I personally have some difficulties making it work.

volker42maru commented 4 years ago

For generation I am using:

max_length=30, temperature=0.95, num_beams=1, length_penalty=0.25, no_repeat_ngram_size=3

You will get slightly better results with a bigger beam size, but the generation method seems incredibly slow (I wonder why that is?).

I trained for 2 epochs on the squad1 train set.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

huggingface / transformers

Train EncoderDecoder Models for question generation #5213

❓ Questions & Help

Details

How to train models for text generation