Open BaoshengHeTR opened 4 years ago
Hi! Training fewer parameters offers some benefit in terms of training steps/second. However, the effect is not enormous because, even though only a few parameters are updated, one still needs to run inference through the entire model.
With careful early stopping, one should be able to train the adapters with relatively few training examples (down to 1000 examples). This will very likely work better than stacking an MLP on top of a frozen model (although of course it depends on the task).
Aside: it may be possible, with careful hyperparameter tuning to tune the entire BERT model on a small number of examples (<1000). (Shameless self-plug) we have done some work in vision, where a pre-trained model with almost 1B parameters can be fully finetuned quite well on 1 image per class! arXiv. I think it is unclear whether this very low-data regime would work well with BERT.
Thanks for the great work here. I have a question, when I read though the paper, I can understand that fewer parameters training should bring speed benefit, and please correct me if this is wrong, since otherwise there is no value for training fewer parameters. If so, I think probably giving the traning time cost is very attractive.
Then my next question is, does using adapter-bert haso can utilize much less training data to do the transfer learning? When we have less data (~10k contexts) we probably want to freeze the whole bert layers and just train the customized layers above them (e.g, a mlp for text-categorization). If adapter-bert can achieve good performance by training on a small dataset, that could be awesome.