Strange embedding from FastText

quoctinphan commented 4 years ago

I am struggled understanding word embeddings of FastText. According to the white paper Enriching Word Vectors with Subword Information, embeddings of a word is the mean (or sum) of embeddings of its subwords.

I failed to verify this. On common_text imported from gensim.test.utils, embedding of user is [-0.03062156 -0.02879291 -0.01737508 -0.02839565]. The mean of embeddings of ['<us', 'use', 'ser', 'er>'] (setting min_n=max_n=3) is [-0.047664 -0.01677518 0.02312234 0.03452689]. The sum of embeddings also result in a different vector.

Is it a mismatch between Gensim implementation and original FastText, or am I missing something?

Below is my code:

import numpy as np
from gensim.models import FastText
from gensim.models._utils_any2vec import compute_ngrams
from gensim.models.keyedvectors import FastTextKeyedVectors
from gensim.test.utils import common_texts

model = FastText(size=4, window=3, min_count=1)
model.build_vocab(sentences=common_texts)
model.train(sentences=common_texts, total_examples=len(common_texts), epochs=10, min_n=3, max_n=3)

print('survey' in model.wv.vocab)
print('ser' in model.wv.vocab)
print('ree' in model.wv.vocab)
ngrams = compute_ngrams('user', 3, 3)
print('num vector of "user": ', model.wv['user'])
print('ngrams of "user": ', ngrams)
print('mean of num vectors of {}: \n{}'.format(ngrams, np.mean([model.wv[c] for c in ngrams], axis=0)))

arti32lehtonen commented 4 years ago

I am not a developer but i think that the behavior is different for known and unknown words.

"user" token was in the training corpus and its word vector was cached. You need to check some tokens that were not presented in the corpus, for example "users".

Also, hashing is used for the ngram indexes calculation. You need to write something like this to achieve full compatible result.

from gensim.models.utils_any2vec import ft_ngram_hashes

print('num vector of "users": ', model.wv['users'])

ngramm_hashes = ft_ngram_hashes('users', model.min_n, model.max_n,
                                model.bucket, model.wv.compatible_hash)
print('ngram hashes of "users": ', ngramm_hashes)
print(np.mean([model.wv.vectors_ngrams[i] for i in ngramm_hashes], axis=0))

currylym commented 4 years ago

I am not a developer but i think that the behavior is different for known and unknown words.

"user" token was in the training corpus and its word vector was cached. You need to check some tokens that were not presented in the corpus, for example "users".

Also, hashing is used for the ngram indexes calculation. You need to write something like this to achieve full compatible result.
from gensim.models.utils_any2vec import ft_ngram_hashes

print('num vector of "users": ', model.wv['users'])

ngramm_hashes = ft_ngram_hashes('users', model.min_n, model.max_n,
                                model.bucket, model.wv.compatible_hash)
print('ngram hashes of "users": ', ngramm_hashes)
print(np.mean([model.wv.vectors_ngrams[i] for i in ngramm_hashes], axis=0))

Good solution! There is another question.If one subword is not in the subword vocab of fasttext, how can i figure it out?

piskvorky / gensim

Strange embedding from FastText #2659