piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.64k stars 4.37k forks source link

Support for OOV words using only word-vectors #1953

Open menshikh-iv opened 6 years ago

menshikh-iv commented 6 years ago

Intro

The heavy limitation of word-vectors (Word2Vec, GloVe, etc) is "fix" vocab, i.e. if you have no word "someword" in the vocab of model - you have no vector for it. This situation fixed for character n-gram embedding models like FastText, but we already have many pretty nice word-embeddings model that trained on large corpora and show exciting results on the bench.

This will be really nice to construct a vector for OOV words to (based on some pre-trained word-vectors).

Algorithm

The pretty nice algorithm presented in https://github.com/plasticityai/magnitude package (more information about the feature: https://github.com/plasticityai/magnitude#basic-out-of-vocabulary-keys and https://github.com/plasticityai/magnitude#advanced-out-of-vocabulary-keys)

Some piece of pseudo-code for demonstration, how this works

import numpy as np

emb_dim = 300
oov_word = 'catsq'
random_vectors = []

for ngram in char_ngrams(oov_word):
    np.random.seed(seed=ngram)
     # why? some "precision" reasons?
    random_vectors.append((np.random.rand(emb_dim) * 2.0) - 1.0)

random_vectors = np.mean(random_vectors, axis=0)
random_vector = random_vector / np.linalg.norm(random_vector)

# _db_query_similar_keys_vector some kind of "text search" between ngrams & existent words in model
# This is the hardest part here, how to make it fast enough

final_vector = (random_vector * 0.3 + self._db_query_similar_keys_vector(key, oov_word) * 0.7)  
final_vector = final_vector / np.linalg.norm(final_vector)

Discussion

Can you answer my questions in comments (# in code) and describe in details, how _db_query_similar_keys_vector works @AjayP13?

AjayP13 commented 6 years ago
    np.random.seed(seed=ngram)
    random_vectors.append((np.random.rand(emb_dim) * 2.0) - 1.0)

Seeding the RNG with the hash(ngram), creates a random vector that is deterministic. If run on a different computer, the random vector generated for the same OOV word will be the same across computers with this method.

Explanation for _db_query_similar_keys_vector:

Mean the results and unit-norm the mean and return.

shrink removes duplicate characters of length >3 or greater to just 2 characters. For example, hiiiiiii becomes hii.

Search the database for keys where the character n-grams of that key of length 6 characters have the highest number of matches to the out-of-vocabulary's character n-grams of length 6 characters. If there are ties, prefer shorter keys with the highest number of matches. If this doesn't yield at least 3 results, then try 5 characters, then try 4 characters, then try 3 characters. Searching for n-grams of length 6 to 3 incrementally makes searching in SQLite faster because you can quit early if you get at least 3 results with 6 character n-grams.

Mean the results and unit-norm the mean and return.

menshikh-iv commented 6 years ago

@AjayP13 about random_vectors.append((np.random.rand(emb_dim) * 2.0) - 1.0), I wasn't clear, sorry, the question is why * 2.0) - 1.0, I don't catch this moment.

Thanks for detailed description of "ngram" search, I think it's possible to make it fast & efficient if we'll store only "hashes" & length for ngrams and perform the search by hashes only (or using some data-structures as Trie for the string "matching").

AjayP13 commented 6 years ago

Ah, the *2.0-1.0 just shifts the randomly generated vectors from values of 0.0-1.0 to values of -1.0-1.0.

Right, I thought about using a tree structure, but had trouble finding a way to memory map it efficiently. (there doesn't seem to be a lot of Python libraries for this sort of thing) And I still didn't want to load the entire tree into memory.

sharanry commented 6 years ago

@menshikh-iv Can I take this up?

menshikh-iv commented 6 years ago

@sharanry yes (but remember that this isn't easy), except feature "as is" we are waiting for benchmarks that show quality of the result (by time/memory/algorithm performance).

sharanry commented 6 years ago

@sharanry yes (but remember that this isn't easy), except feature "as is" we are waiting for benchmarks that show quality of the result (by time/memory/algorithm performance).

@menshikh-iv I did not get you. Should I wait for the benchmarks? or should I make the benchmarks? Are we going to use the algorithm as is?

menshikh-iv commented 6 years ago

@sharanry if you pick this one - we are waiting for benchmark from you :) From start - as is, but probably we'll change some details (like search part or coefficients), but based on the benchmark of course (not random picking parameters).

sharanry commented 6 years ago

@menshikh-iv I have benchmarked the time, I am working on the memory benchmarking. Here is the link to the notebook. https://colab.research.google.com/drive/1p9UhVrCZFmIxiqyvOXEzGpqXayeblupl

menshikh-iv commented 6 years ago

@sharanry no need to benchmark magnitude now (this isn't our goal).

First of all, you should implement very similar technique (algorithm) and check that this works correctly (tests). The second - compare it by time with Magnitude implementation, optimize your implementation until performance not be a good enough.

sharanry commented 6 years ago

@menshikh-iv Are we trying to use this algorithm directly on the vocabulary file, instead of first making a .magnitude file? The bottleneck of magnitude from my understanding is mainly SQL queries.

menshikh-iv commented 6 years ago

@sharanry yes, we want to try to add this out-of-vocab feature to https://github.com/RaRe-Technologies/gensim/blob/122dad657688b51f0176a81a20bd1fa6d0986b8b/gensim/models/keyedvectors.py#L1051 (without .magnitude SQL backend)

sharanry commented 6 years ago

@menshikh-iv So basically input is out-of-vocab key and the output is the out-of-vocab vector?

this latter can be used for other features like similarity, most similar, etc?

menshikh-iv commented 6 years ago

@sharanry exactly

jtlz2 commented 6 years ago

@menshikh-iv This is exactly what I have been searching for today - would be amazing - any idea of timescales? :)

menshikh-iv commented 6 years ago

@jtlz2 no, sorry :(