Some suggestions on attention and document similarity

Hey,

First, thanks for the kind words in various places :). I came across your posts, which led me here.

I also spent quite some time working on similarity models. I think they're surprisingly difficult to implement correctly in most deep learning toolkits. There are two problems:

It's pretty hard to maintain the symmetry. We'd like to guarantee that the Siamese network always maps a pair of identical sentences into identical vectors. Intuitively identical inputs should give 1.0 similarity, right? But lots of thing can go wrong to prevent this. Here are some of the problems I've had in different implementations:

i. Dropout needs to be synchronised across the two 'halves' of the network. If we redraw the weights for the two sentences, we'll end up with different vectors for the same input. This makes the model converge very slowly.

ii. Batch normalization. I don't remember exact results, but I do remember I ended up not wanting to use batch norm in Siamese networks, because I found it too difficult to reason about.

iii. In one of my models, I assigned random vectors to OOV words, without ensuring the same word always mapped to the same vector.

Most libraries make you pad your inputs with zeros. Most attention layers then do some sort of pooling operation. If you're averaging, you need to normalize by only the input tokens, and exclude the padding tokens. We also want to make sure we're not sending gradient through the padding tokens too. I could never get this correct in Keras. You can find my effort to replicate Parikh et al.'s decomposable attention model here: https://github.com/explosion/spaCy/tree/master/examples/keras_parikh_entailment . Other people have worked on the code since, but as far as I know it's still not correct.

Finally, an extra tip that should help your similarity models :). I notice you're using pre-trained embeddings, and are using a fixed-size vocabulary. This means that all words outside your vocabulary will be mapped to the same representation. If you think about it, this is pretty bad: if our input sentences match on some rare word, that's a great feature! I think the best solution is to augment the static vectors with a learned component. Here's an example network that does this: https://github.com/explosion/thinc/blob/master/examples/text-pair/glove_mwe_multipool_siamese.py#L162

The network in that example uses a trickier "Embed" step, that is the sum of the static vectors, and then learned vectors from my HashEmbed class, which uses the "hashing trick" that has been popular in sparse linear models. The insight is similar to Bloom filters, so a recent paper has called this "Bloom embeddings". Basically you just mod the key into a fixed size table, and compute multiple conditionally independent keys per word. This way allows the table to map a very large number of vocabulary items to (mostly) distinct representations, with relatively few rows.

This hash embedding trick isn't the only solution to the OOV problem. Using a character LSTM to create the OOV word features would probably work well too --- but much more slowly.

sujitpal / eeap-examples

Some suggestions on attention and document similarity #2