Why fine-tuning in single-task setting (not as stated in the paper)?

svboeing commented 5 years ago

In the arxiv paper it is stated:

In the multi-task fine-tuning stage, we use minibatch based stochastic gradient descent (SGD) to learn the parameters of our model (i.e., the parameters of all shared layers and task-specific layers) as shown in Algorithm 1. In each epoch, a mini-batch b_t is selected(e.g., among all 9 GLUE tasks), and the model is updated according to the task-specific objective for the task t. This approximately optimizes the sum of all multi-task objectives.

If I understand it correctly, in your code this multi-task fine-tuning stage is called MTL refinement. Then why do you fine-tune for each task in single task setting in your fine-tuning stage? There is no such stage in the original paper. Also, in run_mt_dnn.sh there are lines: train_datasets="mnli,rte,qqp,qnli,mrpc,sst,cola,stsb" test_datasets="mnli_matched,mnli_mismatched,rte" Why do you only test on mnli and rte and not test on all other tasks? I would also like to ask if I can switch from BERT large to BERT base there because i only have one 1080 GTX card.

Thank you.

namisan commented 5 years ago

The paper attached is slightly outdated and it tested on the old version of GLUE (please refer GLUE website for detailed information). If you look into the tasks in GLUE, pair-wised sentence tasks, e.g, NLI, dominates. Thus, the single sentence task is under training. It requires another step of fine-turning to obtain SOTA on the leaderboard. We select models based on MNLI/RTE. Of course, you can test on all the tasks. You can switch to BERT base, if you don't have powerful GPUs. As mentioned in the README, we will update the paper.

colfire commented 5 years ago

I just wonder if we can obtain or approach SOTA on the leaderboard only using feature-based fine-tuning on the single task, instead of the end-to-end fine-tuning. Note: the feature-based fine-tuning and end-to-end fine-tuning are all based on the MT-DNN.

namisan commented 5 years ago

This is a good question. However, I don't know the answer. At first stage, we tried the feature-based approach and find it didn't help. Thus, we haven't explored this direction, but I still believe that it is worth to give a shot.

colfire commented 5 years ago

Thanks for your reply. By using the feature-based approach, we can significantly reduce the inference cost for multiple tasks.

ConanCui commented 5 years ago

I just wonder if we can obtain or approach SOTA on the leaderboard only using feature-based fine-tuning on the single task, instead of the end-to-end fine-tuning. Note: the feature-based fine-tuning and end-to-end fine-tuning are all based on the MT-DNN.

Sorry, I am wondering what‘s the definition about feature-based fine-tuning and end-to-end fine-tuning , and what's the difference between them?

colfire commented 5 years ago

Conan, you can find the definition in original Bert paper for your reference :) https://arxiv.org/pdf/1810.04805.pdf

On Fri, Apr 12, 2019 at 5:12 PM Conan notifications@github.com wrote:

I just wonder if we can obtain or approach SOTA on the leaderboard only using feature-based fine-tuning on the single task, instead of the end-to-end fine-tuning. Note: the feature-based fine-tuning and end-to-end fine-tuning are all based on the MT-DNN. Sorry, I am wondering what‘s the definition about feature-based fine-tuning and 'end-to-end fine-tuning', and what's the difference between them?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/namisan/mt-dnn/issues/7#issuecomment-482500754, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMOQrwIgpaXCmNA4H-_xVOQ15XUPIEkks5vgE3wgaJpZM4b_R1x .

agupta74 commented 5 years ago

Are the MT-DNN results (shown in the image below) for each GLUE task in the paper based on using a single multi-task model or do you fine-tune on each of the GLUE tasks (as specified in this Git repo) on top of the multi-task model?

namisan / mt-dnn

Why fine-tuning in single-task setting (not as stated in the paper)? #7