Open erikchwang opened 5 years ago
IMO it is more about learning language style than actual data. If language of new data set is like Wikipedia BERT works exactly as expected, but when data set is in different style like e.g. blogs BERT results are not good.
@chwang85 It's not cheating because when BERT was pre-trained on Wikipedia, it is trained to know how to predict missing words and classify if a pair of sentence follows from each other.
The task in SQUAD is to pick an answer from an input paragraph given a question. So, although the corpus is still Wikipedia, BERT was never trained on this particular task, so there is no cheating involved. Similar argument applies to SWAG.
I just want to express that BERT is over-estimated. An important reason for this over-estimation is that BERT significantly outperforms the previous SOTAs in SQuAD and SWAG, which use the same corpora as BERT does in its pretraining. If BERT selected corpora other than Wikipedia and BookCorpus to do its pretraining, and still achieved the same performance in SQuAD and SWAG, that would be much more convincing. But I think that is impossible, because as far as I know, in many tasks that are not based on Wikipedia and BookCorpus, BERT only brings very limited performance gain...
What do you think of using a pretrained BERT to further pretrained it on the target corpora before finetuning it to the target task like ULMFiT?
It will be OK as long as your "pretraining" and "further pretraining" do not involve any test data.
It's not any different from using a language model in machine translation to improve translation to that language. Or to use word2vec embeddings as features for a task on the same kind of data. What you are referring to is more like transfer learning. Transferring knowledge acquired from one domain or distribution and leveraging it for another. That said, I agree with you that BERT may not be the ImageNet of NLP as some have hoped (http://ruder.io/nlp-imagenet/). This remains to be shown, maybe be petraining BERT on even more corpora...
You did not get my point. SQuAD (both train and test) is completely from Wikipedia, and SWAG is from BookCorpus. Now BERT is so popular mainly because it achieved SOTA on the two datasets. But, why is BERT pretrained on Wikipedia and BookCorpus? Why not using other corpora? Just explaining by "transfer learning" is not convincing at all...
Are you assuming that the Squad questions are in the Wikipedia data used for training BERT? I don't think so, I could be wrong. If they are not, then I don't think training on wikipedia is anymore cheating than, for example, using a language model as a feature.
Wikipedia has no "questions", questions in SQuAD are generated by human beings. All passages in SQuAD (both train and test) are from Wikipedia. The thing is, when a model has somehow "seen" the test data, it will be much easier for it to achieve good test result. Suppose a student has seen all the materials (passages without questions is enough!) that are used by a reading test, then he will definitely get a much better mark than others.
I really want to see if BERT can achieve the same performance on SQuAD even if it is pretrained on other corpora, such as Google News...
By the way, BERT is also too intrusive to task models. You have to use the tokenization method given by BERT, but what if your text has already been tokenized in a different way?
I know BERT has achieved SOTA in many NLP tasks, such as SQuAD and SWAG. But note that the data (both training and test) of SQuAD is from Wikipedia, and that of SWAG is from the BookCorpus, and BERT is just pre-trained on this two corpora! In other words, BERT has somehow "seen" the test data before test. Is this a kind of cheating?
A valuable question!
@erikchwang The train and test data could come from the same type of data(wikipedia), but they are not the exact same passages, questions or answers. The samples in the test data should not be present in the training data, but they should come from the similar data domain. The idea that data should be i.i.d explains it all.
This is my answer to your question. May be I am missing something very subtle from your question. Enlighten me.
@chunduri11 Yes, you missed almost everything. BERT's pre-training is on the FULL Wikipedia and FULL BookCorpus. But for SQuAD, the fine-tuning of BERT is on a small part of passages from Wikipedia, and the test is on another small part of passages from Wikipedia. So the pre-training data covers both the fine-tuning data and the test data, just in different formats (LM vs. MRC).
I know BERT has achieved SOTA in many NLP tasks, such as SQuAD and SWAG. But note that the data (both training and test) of SQuAD is from Wikipedia, and that of SWAG is from the BookCorpus, and BERT is just pre-trained on this two corpora! In other words, BERT has somehow "seen" the test data before test. Is this a kind of cheating?