piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.46k stars 4.37k forks source link

Strange embedding from FastText #2659

Open quoctinphan opened 4 years ago

quoctinphan commented 4 years ago

I am struggled understanding word embeddings of FastText. According to the white paper Enriching Word Vectors with Subword Information, embeddings of a word is the mean (or sum) of embeddings of its subwords.

I failed to verify this. On common_text imported from gensim.test.utils, embedding of user is [-0.03062156 -0.02879291 -0.01737508 -0.02839565]. The mean of embeddings of ['<us', 'use', 'ser', 'er>'] (setting min_n=max_n=3) is [-0.047664 -0.01677518 0.02312234 0.03452689]. The sum of embeddings also result in a different vector.

Is it a mismatch between Gensim implementation and original FastText, or am I missing something?

Below is my code:

import numpy as np
from gensim.models import FastText
from gensim.models._utils_any2vec import compute_ngrams
from gensim.models.keyedvectors import FastTextKeyedVectors
from gensim.test.utils import common_texts

model = FastText(size=4, window=3, min_count=1)
model.build_vocab(sentences=common_texts)
model.train(sentences=common_texts, total_examples=len(common_texts), epochs=10, min_n=3, max_n=3)

print('survey' in model.wv.vocab)
print('ser' in model.wv.vocab)
print('ree' in model.wv.vocab)
ngrams = compute_ngrams('user', 3, 3)
print('num vector of "user": ', model.wv['user'])
print('ngrams of "user": ', ngrams)
print('mean of num vectors of {}: \n{}'.format(ngrams, np.mean([model.wv[c] for c in ngrams], axis=0)))
arti32lehtonen commented 4 years ago

I am not a developer but i think that the behavior is different for known and unknown words.

"user" token was in the training corpus and its word vector was cached. You need to check some tokens that were not presented in the corpus, for example "users".

Also, hashing is used for the ngram indexes calculation. You need to write something like this to achieve full compatible result.

from gensim.models.utils_any2vec import ft_ngram_hashes

print('num vector of "users": ', model.wv['users'])

ngramm_hashes = ft_ngram_hashes('users', model.min_n, model.max_n,
                                model.bucket, model.wv.compatible_hash)
print('ngram hashes of "users": ', ngramm_hashes)
print(np.mean([model.wv.vectors_ngrams[i] for i in ngramm_hashes], axis=0))
currylym commented 4 years ago

I am not a developer but i think that the behavior is different for known and unknown words.

"user" token was in the training corpus and its word vector was cached. You need to check some tokens that were not presented in the corpus, for example "users".

Also, hashing is used for the ngram indexes calculation. You need to write something like this to achieve full compatible result.

from gensim.models.utils_any2vec import ft_ngram_hashes

print('num vector of "users": ', model.wv['users'])

ngramm_hashes = ft_ngram_hashes('users', model.min_n, model.max_n,
                                model.bucket, model.wv.compatible_hash)
print('ngram hashes of "users": ', ngramm_hashes)
print(np.mean([model.wv.vectors_ngrams[i] for i in ngramm_hashes], axis=0))

Good solution! There is another question.If one subword is not in the subword vocab of fasttext, how can i figure it out?