Closed lvjiujin closed 2 years ago
Hi,
Every character will match a different number of words, which up to the value of max_word_num. We use input_word_mask to denote which words are valid. That code is used to make sure only valid words contributes to the model. Hope it is helpful.
Hi,
Every character will match a different number of words, which up to the value of max_word_num. We use input_word_mask to denote which words are valid. That code is used to make sure only valid words contributes to the model. Hope it is helpful.
Thank you, I see, because after the bi-linear transform, the word_outputs becomes alpha,which result in the vague infomation of matched word and unmatched word, so use this line code to weaken the unmatched word infomation and attention.
When you complete the bi-linear attention in the class BertLayer like the following code :
I don't understand the code
alpha = alpha + (1 - input_word_mask.float()) * (-10000.0)
can you give me an explanation?