Open quoctinphan opened 4 years ago
I am not a developer but i think that the behavior is different for known and unknown words.
"user" token was in the training corpus and its word vector was cached. You need to check some tokens that were not presented in the corpus, for example "users".
Also, hashing is used for the ngram indexes calculation. You need to write something like this to achieve full compatible result.
from gensim.models.utils_any2vec import ft_ngram_hashes
print('num vector of "users": ', model.wv['users'])
ngramm_hashes = ft_ngram_hashes('users', model.min_n, model.max_n,
model.bucket, model.wv.compatible_hash)
print('ngram hashes of "users": ', ngramm_hashes)
print(np.mean([model.wv.vectors_ngrams[i] for i in ngramm_hashes], axis=0))
I am not a developer but i think that the behavior is different for known and unknown words.
"user" token was in the training corpus and its word vector was cached. You need to check some tokens that were not presented in the corpus, for example "users".
Also, hashing is used for the ngram indexes calculation. You need to write something like this to achieve full compatible result.
from gensim.models.utils_any2vec import ft_ngram_hashes print('num vector of "users": ', model.wv['users']) ngramm_hashes = ft_ngram_hashes('users', model.min_n, model.max_n, model.bucket, model.wv.compatible_hash) print('ngram hashes of "users": ', ngramm_hashes) print(np.mean([model.wv.vectors_ngrams[i] for i in ngramm_hashes], axis=0))
Good solution! There is another question.If one subword is not in the subword vocab of fasttext, how can i figure it out?
I am struggled understanding word embeddings of FastText. According to the white paper Enriching Word Vectors with Subword Information, embeddings of a word is the mean (or sum) of embeddings of its subwords.
I failed to verify this. On
common_text
imported fromgensim.test.utils
, embedding ofuser
is[-0.03062156 -0.02879291 -0.01737508 -0.02839565]
. The mean of embeddings of ['<us', 'use', 'ser', 'er>'] (settingmin_n=max_n=3
) is[-0.047664 -0.01677518 0.02312234 0.03452689]
. The sum of embeddings also result in a different vector.Is it a mismatch between Gensim implementation and original FastText, or am I missing something?
Below is my code: