google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.8k stars 9.56k forks source link

How to get masked word prediction probabilities #608

Open Oxi84 opened 5 years ago

Oxi84 commented 5 years ago

Original sentence: i love apples. there are a lot of fruits in the world that i like, but apples would be my favorite fruit. Masked sentence: i love apples . there are a lot of fruits in the world that i [MASK] , but apples would be my favorite fruit .

When I run through the pytorch version of bert, I get the following representations of probabilities:

Best predicted word: ['love'] tensor(12.7276, grad_fn=) Other words along with their probabilities: ['like'] tensor(10.2872, grad_fn=) ['miss'] tensor(8.8226, grad_fn=) ['know'] tensor(8.5971, grad_fn=) ['am'] tensor(7.9407, grad_fn=) ['hate'] tensor(7.9209, grad_fn=) ['mean'] tensor(7.8873, grad_fn=) ['enjoy'] tensor(7.8813, grad_fn=) ['want'] tensor(7.6885, grad_fn=) ['prefer'] tensor(7.5712, grad_fn=)

I am quite sure that this does not mean that probability for word "love" is proportional to 12.7276 and for word "like" is 10.2872. I also know that the summ of all func(this number) thought the whole vocabulary is 1. But I do not know what the func is?

Thanks

Oxi84 commented 5 years ago

I see the that output function for BERT should be log_softmax, but IMO then all the values should be less than zero, because softwax is less that 1.

nellymin commented 5 years ago

I think that the values you have shown are the logits and they need to be passed through the log_softmax function. It would then produce the actual probabilities. As stated in the wikipedia page In mathematics, the softmax function, also known as softargmax[1] or normalized exponential function,[2]:198 is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval (0,1), and the components will add up to 1, so that they can be interpreted as probabilities. so no wonders why they don't add up to 1

Oxi84 commented 5 years ago

Thanks. Do you mean softmax which is softmax_x1 = (e^x1/(e^x1 + e^x2 + ... e^Xn) or log(softmax_x1)? Thanks