can the pre-trained model be used as a language model?

google-research / bert

TensorFlow code and pre-trained models for BERT

https://arxiv.org/abs/1810.04805

Apache License 2.0

37.87k stars 9.56k forks source link

can the pre-trained model be used as a language model? #35

Closed wangwang110 closed 5 years ago

wangwang110 commented 5 years ago

how can we use the pre-trained model to get the probability of one sentence?

jacobdevlin-google commented 5 years ago

It can't, you can only use it to get probabilities if a single missing word in a sentence (or a small number of missing words). This is one of the fundamental ideas, that masked LMs give you deep bidirectionality, but you no longer have a well-formed probability distribution over the sentence (which in general, we don't care about).

xu-song commented 5 years ago

What about mask each word sequentially. Then score sentence by summary of word score.

hscspring commented 5 years ago

using BERT as a language Model · Issue #37 · huggingface/pytorch-pretrained-BERT

It's actually like what @jacobdevlin-google have said, bert is really not a language model.

WolfNiu commented 5 years ago

What about mask each word sequentially. Then score sentence by summary of word score.

That way your calculation won't be correct.

Let's say the sentence has only two tokens x1 and x2. Your calculation well give P(x1 | x2) * P(x2 | x1), which doesn't lead to the probability of the whole sentence. Note that this is not to say what you intended was not doable -- it's just that your way probably won't work.

Bachstelze commented 5 years ago

Alex Wang and Kyunghyun Cho are using in BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model the unnormalized log-probabilities to rank a set of sentences. For this purpose it seems to work.

Shujian2015 commented 5 years ago

You can fine-tune BERT to be LM: https://arxiv.org/abs/1904.09408