huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.04k stars 27.02k forks source link

Fine tune T5 for paraphrase generation #6007

Closed mengyahuUSTC-PU closed 4 years ago

mengyahuUSTC-PU commented 4 years ago

❓ Questions & Help

Dear all, I am new to NLP and has a lot of questions. Sorry to ask this long list here. I tried asking on huggingface's forum but as a new user, I can only put 2 lines there.

My goal is to fine-tuned t5-large for paraphrase generation. I found this code which is based on this code. So I just modified to further fine tune on my dataset. My questions( I also asked some of them on the github code mentioned above but I feel these question may be better address here):

  1. I saw for 2 epoches and the paraphrases generated looks good. When I trained for 11 epochs and the model seems overfitted (the paraphrases generated is similar to the original sentence). Do you have any recommendation for further improving the performance beside decrease the number of epochs?

  2. For paraphrase generation using T5 as a text-to-text task, I don't know how to utilize the negative examples (pairs that are not paraphrases) directly here. Any recommendation?

  3. One idea I have to include the negative examples is : I plan to first further fine tune T5-large's paraphrase identification with my data set (with positive and negative examples) and then used this fine tuned version to further fine tune on paraphrase generation. My assumption is the information learned in paraphrase identification task will help improve paraphrase generation. Is this correct?

  4. I am also a little confused about the prefix. On huggingface'T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: translate English to German: …, summarize: …. For more information about which prefix to use, it is easiest to look into Appendix D of the paper '. Thus, I think that prefix notifies the T5 which task he should be performing. The answer of this thread agrees with my understanding. However, in the first two examples here, the code seems only add at the end of the sentence "< / s >" but no 'prefix'. Could you tell me why? Do that mean these fine-tuned model will not do pretrain tasks of T5 but only their specific trained task so we don't need a prefix?

5.Also, MRPC and QQP are both paraphrase identification. If I want to fine tune, should I use my data set to fine with both of their prefix, or fine tune with one of their prefix or create my own prefix?

  1. The loss function of the code is cross entropy, which is not the best for this task. I am thinking if I can use the paraphrase identification result (like the probability of being paraphrase) as the target function. Is this OK? I feel it maybe suuuupppeer slow. And I am not really sure how to implement it.

  2. I have 3 paraphrase datasets (let's call them A,B,C) from different sources. Previously, I first train the model on A for 2 epoch, and then load this model as pretrained model to further train B for 2 epoches, and then C. I then Combine the A,B,C into one dataset and directly trained for 2 epoches. The resulting two models has different results and the second one is worse. I have the same random seeds for them. Any idea?

  3. I set early_stop_callback= True and set max_epochs=32, then it stops at epoch 11. But if I set max_epochs = 6, it stops at epoch 3. I don't understand, as I thought it will stop at epoch 6. I have the same random seed.

  4. Another strange thing during training, I saw this on the screen: Epoch 10: 100%............(time, loss et al)... INFO:main:avg_train_loss = tensor(..) INFO:main:epoch = 8 ........ Why the epoch number is not the same?!

  5. what is the correct way to evaluate on testing set? I saw several different examples.

In this example, t5 = T5ForConditionalGeneration.from_pretrained('output/') input_ids = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True) # Batch size 1 t5.eval() generated_ids = t5.generate( input_ids=input_ids, num_beams=1, max_length=80,

repetition_penalty=2.5

).squeeze()
predicted_span = tokenizer.decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
return predicted_span

This code has two examples: First directly: outs = model.model.generate(input_ids=batch['source_ids'].cuda(), attention_mask=batch['source_mask'].cuda(), max_length=2) He also has : loader = DataLoader(dataset, batch_size=32, num_workers=4) model.model.eval() outputs = [] targets = [] for batch in tqdm(loader): outs = model.model.generate(input_ids=batch['source_ids'].cuda(), attention_mask=batch['source_mask'].cuda(), max_length=2)

Is eval() necessary? From Huggingface's doc, it seems not necessary when the model is 'from pretrain': "The model is set in evaluation mode by default using model.eval() (Dropout modules are deactivated) To train the model, you should first set it back in training mode with model.train()" On the other side, all the examples above, none of them uses 'model.train()' to set the mode. They directly train the model. I am confused.

Is model.model necessary?

Thanks!!

patil-suraj commented 4 years ago

Phew, won't be able answer all the questions in single comment, will try my best.

  1. IMO for generating paraphrases you probably won't need negative examples.

  2. You assumption is correct, knowledge learned from one task can be useful for other similar tasks. You can approach this as a multitask problem.

    1. identify if two sentences are paraphrases of each other
    2. given one sentence generate it's paraphrase

    With T5 you can use task prefixes for multitask learning, so for identification your example could look something like

    input_text: "sent1: sentence 1 text  sent2: sentence two text"
    target_text: "equivalent" if sent2 is paraphrase of sent1 else "not equivalent"

    and for generation

    input_text: "paraphrase: sentence"
    taregt_tetx: "paraphrase of sentence"
  3. Task prefixes are not required for T5 (required when doing multitask training), but if your task is similar to one of the tasks used in T5's pre-training mixture, then use the prefixes so you can exploit the knowledge already learned by the model. In my notebook I didn't use task prefixes (even though I was doing sentiment classification) because I wanted to see if it makes any difference if we don't use prefixes. Again use task prefixes when doing multitask learning or if your task similar to one of the tasks used in T5's pre-training mixture.

  4. Check which dataset is closer to your own task and make decide on that.

10) I used model.model because here the first model is an instance of lightening model and the HF model is initialized in the first model so model.model, but once you save using .save_pretrained then you can load using .from_pretrained and you can do model.generate.

And for evaluation you can use BLUE, ROUGE and METEOR metrics. I usually use nlg-eval for calculating theses metrics. Generate predictions on your test data, then give your original reference file and generated file to nlg eval and it'll calculate the metrics.

And yes .from_pretrained sets model in eval model by default.

Hope this helps.

mengyahuUSTC-PU commented 4 years ago

Thank you, @patil-suraj ! I learn a lot from your answers!

4. I thought T5 is already pretrained with several with several different tasks, which means T5 is multitasking model. We can use the prefix in the appendix of the paper to perform the corresponding task. Though we finetuned it for a new task (like in your notebook), the ability for pretrained task is not lost, right? If so, it is surprising for me T5 didn't mess thing up and knows what to do when you didn't give it any prefix.

5. & 3. If I want to finetune on QQP, should I also use MRPC's data set (i.e. my own data+ MRPC)? On the other hand, if I train a new prefix, should I use QQP + MRPC + my own data? Will the finetuned T5 overfit a little for QQP and MRPC, as the model see them several times (though for the training of different prefix)? Similarly, if I use the QQP+MRPC+my dataset to finetune T5 paraphrase detection AND then use the positive examples in the QQP+MRPC+my data set to finetune T5 paraphrase generation, will this be information leakage? Should I avoid using same positive examples in two tasks?

10. none of the example uses 'model.train()' to set the mode for training. Is this redundant?

11. Thanks for suggestion nlg-eval ! However, metrics like BLUE can't really evaluate the quality of paraphrase generated. ( Like really good ones should have diverse phrases and structures but still same meaning).

Hope someone can address question 1. 6. 7. 8. 9. too.

patil-suraj commented 4 years ago

4. Yes T5 is a multitask model and you can use the the prefixes to perform the corresponding task. But note that, the results reported on individual tasks in the paper are reported after again fine-tuning the model on that task specifically. And after fine-tuning the model can forget about other tasks

5. To answer the first question, you probably won't need to use those datasets, by using task prefix the model can exploit already available knowledge.

10. pytorch lightning automatically puts model in train mode for train loop and in eval mode for eval loop.

11. Yes, BLUE won't make much sense, ROUGE seems like a good metric for this task.

mengyahuUSTC-PU commented 4 years ago

Thanks, @patil-suraj .

4. Could you explain what do you mean by forget? Here is my understanding: model is first fine-tuned on task A with data set a using prefix Aa, so now the model has the set of parameters aa and we call it model AA; Then I use the resulting model to further fine tune on task B with data set b using Bb, so the model's parameters change to bb and we call it model BB. Thus, if we use the final model BB to perform task on task A, the model may/may not 'recoganize' prefix Aa, but the performance will be worse than model AA.

If what I say above is correct, then my original understanding that transfer learning 'using the same model (structure and set of parameters are the same) for different tasks' is wrong. If so, the transfer learning or learning from multiple tasks are just give better initialization using the current task result for the next task.

5. If the understanding in 4 is correct, I think I may need to reuse the data set when training a new prefix.

patil-suraj commented 4 years ago

4., forget here is the context of multitask learning, if you take a multitask model and then only fine-tune it for one task then there's a chance that it can forget about other tasks.

mengyahuUSTC-PU commented 4 years ago

Thanks,@patil-suraj! How about just one prefix/task? Will the model forget?

For example, I have paraphrase data set A and paraphrase data set B. Fine tune 1: I first fine tune t5-large on data set A using prefix 'para:' with 2 epoch. The resulting model is T5-A. I then fine tune t5-A on data set B using prefix 'para:' with 2 epoch. The resulting model is T5-B. Fine tune 2: I first combine data set A and data set B into one file. Then I fine tune t5-large on the combined data set using prefix 'para:' with 2 epoch. The resulting model is T5-2.

Will T5-B forget about data set A? I tried two fine tunning methods and T5-2 seems worse than T5-B (T5-2 with more epoches seems worse than T5-B too).

My thought: If it is gradient descent and both methods have converged and only one optimal solution, they should have no difference. However, in real life, there are maybe a lot of local optimum, numerically, there is no guarantee which of the method is better andThe T5-2 should have a higher chance to be better as it has larger data set and prevent overfitting.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.