Splitting context and question for BERT

google-research / bert

TensorFlow code and pre-trained models for BERT

Apache License 2.0

38.17k stars 9.6k forks source link

The paper suggests that the softmax for start and end logits is computed only over the context hidden states. However, in the code, I could not figure out any place where the hidden states were being split into query and context hidden states. Does that mean that in the code, we are calculating the softmax over the query and context hiddens, and expect the model to learn that the answer lies in context hiddens? Or can you point me to a place, where you are splitting the hidden states and then using the final linear layer only over the context.

google-research / bert

Splitting context and question for BERT #495