Open honnibal opened 7 years ago
Thank you for the detailed advice and the pointers. I will probably need some time to digest all this. The bloom filter embedding sounds like a very cool trick, I recently came across a similar idea from the collaborative filtering world, where the number of items are very high so they are reducing it to a lower dimension space using MurmurHash.
Hey,
First, thanks for the kind words in various places :). I came across your posts, which led me here.
I also spent quite some time working on similarity models. I think they're surprisingly difficult to implement correctly in most deep learning toolkits. There are two problems:
i. Dropout needs to be synchronised across the two 'halves' of the network. If we redraw the weights for the two sentences, we'll end up with different vectors for the same input. This makes the model converge very slowly.
ii. Batch normalization. I don't remember exact results, but I do remember I ended up not wanting to use batch norm in Siamese networks, because I found it too difficult to reason about.
iii. In one of my models, I assigned random vectors to OOV words, without ensuring the same word always mapped to the same vector.
Finally, an extra tip that should help your similarity models :). I notice you're using pre-trained embeddings, and are using a fixed-size vocabulary. This means that all words outside your vocabulary will be mapped to the same representation. If you think about it, this is pretty bad: if our input sentences match on some rare word, that's a great feature! I think the best solution is to augment the static vectors with a learned component. Here's an example network that does this: https://github.com/explosion/thinc/blob/master/examples/text-pair/glove_mwe_multipool_siamese.py#L162
The network in that example uses a trickier "Embed" step, that is the sum of the static vectors, and then learned vectors from my
HashEmbed
class, which uses the "hashing trick" that has been popular in sparse linear models. The insight is similar to Bloom filters, so a recent paper has called this "Bloom embeddings". Basically you just mod the key into a fixed size table, and compute multiple conditionally independent keys per word. This way allows the table to map a very large number of vocabulary items to (mostly) distinct representations, with relatively few rows.This hash embedding trick isn't the only solution to the OOV problem. Using a character LSTM to create the OOV word features would probably work well too --- but much more slowly.