Open menshikh-iv opened 6 years ago
np.random.seed(seed=ngram)
random_vectors.append((np.random.rand(emb_dim) * 2.0) - 1.0)
Seeding the RNG with the hash(ngram), creates a random vector that is deterministic. If run on a different computer, the random vector generated for the same OOV word will be the same across computers with this method.
Explanation for _db_query_similar_keys_vector
:
Mean the results and unit-norm the mean and return.
BOW=EOW=u"\uF000"
key = BOW+shrink(key)+EOW
BOW, EOW are private use unicode codepoints which should be fairly rare characters. They help make the string similarity search more robust so you prefer words that maintain some sense of character ordering when scoring against number of matching character n-grams.
shrink
removes duplicate characters of length >3 or greater to just 2 characters. For example, hiiiiiii
becomes hii
.
Search the database for keys where the character n-grams of that key of length 6 characters have the highest number of matches to the out-of-vocabulary's character n-grams of length 6 characters. If there are ties, prefer shorter keys with the highest number of matches. If this doesn't yield at least 3 results, then try 5 characters, then try 4 characters, then try 3 characters. Searching for n-grams of length 6 to 3 incrementally makes searching in SQLite faster because you can quit early if you get at least 3 results with 6 character n-grams.
Mean the results and unit-norm the mean and return.
Full Algorithm - Edge Case: If the length of the key is <= 6 characters. Sort results by ones that start and end with the same character, then sort by highest number of matches on character n-grams, then sort by shortest length. Empirically, this performs better for shorter words.
Performance / Efficiency: We achieve performance by pre-computing char n-grams and storing them in a SQLite FTS indexed table. This gives us a time trade off during conversion (takes longer to convert from .txt
/.bin
to .magnitude
which is usually fine since conversion only has to be done once) and a space trade off on the .magnitude
file size (adds around ~1GB to Google's word2vec).
@AjayP13 about random_vectors.append((np.random.rand(emb_dim) * 2.0) - 1.0)
, I wasn't clear, sorry, the question is why * 2.0) - 1.0
, I don't catch this moment.
Thanks for detailed description of "ngram" search, I think it's possible to make it fast & efficient if we'll store only "hashes" & length for ngrams and perform the search by hashes only (or using some data-structures as Trie for the string "matching").
Ah, the *2.0-1.0
just shifts the randomly generated vectors from values of 0.0-1.0 to values of -1.0-1.0.
Right, I thought about using a tree structure, but had trouble finding a way to memory map it efficiently. (there doesn't seem to be a lot of Python libraries for this sort of thing) And I still didn't want to load the entire tree into memory.
@menshikh-iv Can I take this up?
@sharanry yes (but remember that this isn't easy), except feature "as is" we are waiting for benchmarks that show quality of the result (by time/memory/algorithm performance).
@sharanry yes (but remember that this isn't easy), except feature "as is" we are waiting for benchmarks that show quality of the result (by time/memory/algorithm performance).
@menshikh-iv I did not get you. Should I wait for the benchmarks? or should I make the benchmarks? Are we going to use the algorithm as is?
@sharanry if you pick this one - we are waiting for benchmark from you :) From start - as is, but probably we'll change some details (like search part or coefficients), but based on the benchmark of course (not random picking parameters).
@menshikh-iv I have benchmarked the time, I am working on the memory benchmarking. Here is the link to the notebook. https://colab.research.google.com/drive/1p9UhVrCZFmIxiqyvOXEzGpqXayeblupl
@sharanry no need to benchmark magnitude now (this isn't our goal).
First of all, you should implement very similar technique (algorithm) and check that this works correctly (tests). The second - compare it by time with Magnitude implementation, optimize your implementation until performance not be a good enough.
@menshikh-iv Are we trying to use this algorithm directly on the vocabulary file, instead of first making a .magnitude
file? The bottleneck of magnitude from my understanding is mainly SQL queries.
@sharanry yes, we want to try to add this out-of-vocab feature to https://github.com/RaRe-Technologies/gensim/blob/122dad657688b51f0176a81a20bd1fa6d0986b8b/gensim/models/keyedvectors.py#L1051 (without .magnitude
SQL backend)
@menshikh-iv So basically input is out-of-vocab key and the output is the out-of-vocab vector?
this latter can be used for other features like similarity, most similar, etc?
@sharanry exactly
@menshikh-iv This is exactly what I have been searching for today - would be amazing - any idea of timescales? :)
@jtlz2 no, sorry :(
Intro
The heavy limitation of word-vectors (Word2Vec, GloVe, etc) is "fix" vocab, i.e. if you have no word
"someword"
in the vocab of model - you have no vector for it. This situation fixed for character n-gram embedding models like FastText, but we already have many pretty nice word-embeddings model that trained on large corpora and show exciting results on the bench.This will be really nice to construct a vector for OOV words to (based on some pre-trained word-vectors).
Algorithm
The pretty nice algorithm presented in https://github.com/plasticityai/magnitude package (more information about the feature: https://github.com/plasticityai/magnitude#basic-out-of-vocabulary-keys and https://github.com/plasticityai/magnitude#advanced-out-of-vocabulary-keys)
Some piece of pseudo-code for demonstration, how this works
Discussion
Can you answer my questions in comments (
#
in code) and describe in details, how_db_query_similar_keys_vector
works @AjayP13?