How exactly does head embedding get used?

kentonl / e2e-coref

End-to-end Neural Coreference Resolution

Apache License 2.0

518 stars 174 forks source link

How exactly does head embedding get used? #30

Closed kalpitdixit closed 5 years ago

kalpitdixit commented 5 years ago

glove_50_300_2.txt is downloaded as head_embeddings. What exactly does the 50 refer to?
How are these used in coref_model.py? It seems that context_outputs is computed a LSTM over the ELMo embeddings. Then for each span, the context_outputs give a distribution over the tokens. But then then this distribution is used to get a weighted sum over the head_embeddings.https://github.com/kentonl/e2e-coref/blob/a24d1070c2b7e50bc71cfeb6881c8abfc870451c/coref_model.py#L379 What I want to confirm is that, head_embeddings are not used to calculate the distribution over head_embeddings?
Also from the papers "Experimental Setup", I didn't follow the meaning of "window size" for word embeddings and LSTM? using GloVe word embeddings (Pennington et al., 2014) with a window size of 2 for the head word embeddings and a window size of 10 for the LSTM inputs.

kentonl commented 5 years ago

50 refers to the threshold for the minimum frequency of the vocabulary.
Yes, your interpretation is exactly right. You can think of it as separating the key and values of the attention mechanisms. In this case, the head_embeddings are the values, and the keys are determined by the context_embeddings. In hindsight, I probably should have removed this separation for simplicity in the final version, since the improvement wasn't very large.
The window size is referring to the hyperparameter on the x-axis in Figure 2b of the GloVe paper (https://nlp.stanford.edu/pubs/glove.pdf). The wording in the paper is a bit confusing. We were just trying to say that the context_embeddings have a window size of 10 and the head_embeddings have a window size of 2.

Hope that helps!

kalpitdixit commented 5 years ago

Thanks for the fast and complete answers!

For "3." above, I see how using a smaller window size for the head_embeddings compared to the context_embeddings makes sense. Because the head_embeddings are used to represent a span which is typically a few tokens vs context_embeddings which are used to represent entire sentences. Nice idea.