Support for OOV words using only word-vectors

menshikh-iv commented 6 years ago

Intro

The heavy limitation of word-vectors (Word2Vec, GloVe, etc) is "fix" vocab, i.e. if you have no word "someword" in the vocab of model - you have no vector for it. This situation fixed for character n-gram embedding models like FastText, but we already have many pretty nice word-embeddings model that trained on large corpora and show exciting results on the bench.

This will be really nice to construct a vector for OOV words to (based on some pre-trained word-vectors).

Algorithm

The pretty nice algorithm presented in https://github.com/plasticityai/magnitude package (more information about the feature: https://github.com/plasticityai/magnitude#basic-out-of-vocabulary-keys and https://github.com/plasticityai/magnitude#advanced-out-of-vocabulary-keys)

Some piece of pseudo-code for demonstration, how this works

import numpy as np

emb_dim = 300
oov_word = 'catsq'
random_vectors = []

for ngram in char_ngrams(oov_word):
    np.random.seed(seed=ngram)
     # why? some "precision" reasons?
    random_vectors.append((np.random.rand(emb_dim) * 2.0) - 1.0)

random_vectors = np.mean(random_vectors, axis=0)
random_vector = random_vector / np.linalg.norm(random_vector)

# _db_query_similar_keys_vector some kind of "text search" between ngrams & existent words in model
# This is the hardest part here, how to make it fast enough

final_vector = (random_vector * 0.3 + self._db_query_similar_keys_vector(key, oov_word) * 0.7)  
final_vector = final_vector / np.linalg.norm(final_vector)

Discussion

Can you answer my questions in comments (# in code) and describe in details, how _db_query_similar_keys_vector works @AjayP13?

AjayP13 commented 6 years ago

    np.random.seed(seed=ngram)
    random_vectors.append((np.random.rand(emb_dim) * 2.0) - 1.0)

Seeding the RNG with the hash(ngram), creates a random vector that is deterministic. If run on a different computer, the random vector generated for the same OOV word will be the same across computers with this method.

Explanation for _db_query_similar_keys_vector:

Input: Out-of-vocabulary key
Returns: Mean unit vector of 3 most similar keys by string similarity
Basic Algorithm: Search the database for keys where the character n-grams of that key of length 3 characters to length 6 characters have the highest number of matches to the out-of-vocabulary key's character n-grams of length 3 characters to length 6 characters. If there are ties, prefer shorter keys with the highest number of matches.

Mean the results and unit-norm the mean and return.

Full Algorithm: Transform the key such that
```
BOW=EOW=u"\uF000"
key = BOW+shrink(key)+EOW
```
BOW, EOW are private use unicode codepoints which should be fairly rare characters. They help make the string similarity search more robust so you prefer words that maintain some sense of character ordering when scoring against number of matching character n-grams.

shrink removes duplicate characters of length >3 or greater to just 2 characters. For example, hiiiiiii becomes hii.

Search the database for keys where the character n-grams of that key of length 6 characters have the highest number of matches to the out-of-vocabulary's character n-grams of length 6 characters. If there are ties, prefer shorter keys with the highest number of matches. If this doesn't yield at least 3 results, then try 5 characters, then try 4 characters, then try 3 characters. Searching for n-grams of length 6 to 3 incrementally makes searching in SQLite faster because you can quit early if you get at least 3 results with 6 character n-grams.

Mean the results and unit-norm the mean and return.

Full Algorithm - Edge Case: If the length of the key is <= 6 characters. Sort results by ones that start and end with the same character, then sort by highest number of matches on character n-grams, then sort by shortest length. Empirically, this performs better for shorter words.
Performance / Efficiency: We achieve performance by pre-computing char n-grams and storing them in a SQLite FTS indexed table. This gives us a time trade off during conversion (takes longer to convert from .txt/.bin to .magnitude which is usually fine since conversion only has to be done once) and a space trade off on the .magnitude file size (adds around ~1GB to Google's word2vec).

menshikh-iv commented 6 years ago

@AjayP13 about random_vectors.append((np.random.rand(emb_dim) * 2.0) - 1.0), I wasn't clear, sorry, the question is why * 2.0) - 1.0, I don't catch this moment.

Thanks for detailed description of "ngram" search, I think it's possible to make it fast & efficient if we'll store only "hashes" & length for ngrams and perform the search by hashes only (or using some data-structures as Trie for the string "matching").

AjayP13 commented 6 years ago

Ah, the *2.0-1.0 just shifts the randomly generated vectors from values of 0.0-1.0 to values of -1.0-1.0.

Right, I thought about using a tree structure, but had trouble finding a way to memory map it efficiently. (there doesn't seem to be a lot of Python libraries for this sort of thing) And I still didn't want to load the entire tree into memory.

sharanry commented 6 years ago

@menshikh-iv Can I take this up?

menshikh-iv commented 6 years ago

@sharanry yes (but remember that this isn't easy), except feature "as is" we are waiting for benchmarks that show quality of the result (by time/memory/algorithm performance).

sharanry commented 6 years ago

@sharanry yes (but remember that this isn't easy), except feature "as is" we are waiting for benchmarks that show quality of the result (by time/memory/algorithm performance).

@menshikh-iv I did not get you. Should I wait for the benchmarks? or should I make the benchmarks? Are we going to use the algorithm as is?

menshikh-iv commented 6 years ago

@sharanry if you pick this one - we are waiting for benchmark from you :) From start - as is, but probably we'll change some details (like search part or coefficients), but based on the benchmark of course (not random picking parameters).

sharanry commented 6 years ago

@menshikh-iv I have benchmarked the time, I am working on the memory benchmarking. Here is the link to the notebook. https://colab.research.google.com/drive/1p9UhVrCZFmIxiqyvOXEzGpqXayeblupl

menshikh-iv commented 6 years ago

@sharanry no need to benchmark magnitude now (this isn't our goal).

First of all, you should implement very similar technique (algorithm) and check that this works correctly (tests). The second - compare it by time with Magnitude implementation, optimize your implementation until performance not be a good enough.

sharanry commented 6 years ago

@menshikh-iv Are we trying to use this algorithm directly on the vocabulary file, instead of first making a .magnitude file? The bottleneck of magnitude from my understanding is mainly SQL queries.

menshikh-iv commented 6 years ago

@sharanry yes, we want to try to add this out-of-vocab feature to https://github.com/RaRe-Technologies/gensim/blob/122dad657688b51f0176a81a20bd1fa6d0986b8b/gensim/models/keyedvectors.py#L1051 (without .magnitude SQL backend)

sharanry commented 6 years ago

@menshikh-iv So basically input is out-of-vocab key and the output is the out-of-vocab vector?

this latter can be used for other features like similarity, most similar, etc?

menshikh-iv commented 6 years ago

@sharanry exactly

jtlz2 commented 6 years ago

@menshikh-iv This is exactly what I have been searching for today - would be amazing - any idea of timescales? :)

menshikh-iv commented 6 years ago

@jtlz2 no, sorry :(

piskvorky / gensim