Closed DennisCologne closed 5 years ago
@Witiko can you have a look?
Hey @DennisCologne,
sorry to say I am the author of the code that gives you trouble. What Gensim and Python versions are you using? I can run the above code without issue with the PyPI version of Gensim (3.4.0), and Python 3.5 just fine.
>>> sims
[(6, 0.8305764039419705),
(7, 0.7257781024707816),
(5, 0.5584027708699971),
(0, 0.43455470767273646),
(8, 0.4082457402348116),
(1, 0.3028528215099456),
(3, 0.09251811314306692),
(4, 0.07636744554253587),
(2, 0.04509321490371689)]
Hi @Witiko,
thank you for your answer.
Actually, it is Python 2.7.14 with Gensim 3.4.0... after further investigation, the matrix-vector multiplication returns a negative value even though all of the values in both are positive.
But you are right, I just tried it on my Python 3.6 environment and there it works fine. I guess I will use this environment than. But this problem might still be interesting for you.
Thanks again for the quick reply.
Best, Dennis
Hey @DennisCologne,
this is definitely interesting, but I can't seem to reproduce your problem even with Python 2.7 and Gensim 3.4.0. Can you find a pair of document vectors vec1
, and vec2
that trigger the issue, call softcossim(vec1, vec2)
, and share what the content of vec1
, vec2
, dense_matrix
, vec1len
, and vec2len
is just before the failing assertion?
ping @DennisCologne, please provide information for reproducing an error (that requested in https://github.com/RaRe-Technologies/gensim/issues/2105#issuecomment-401103021)
ping @DennisCologne
Similar issue with SoftCosineSimilarity. Please check at https://groups.google.com/forum/#!topic/gensim/WVTRdZONtrc Python2.7, gensim 3.7
ping @Witiko
I fail to see how this is related to the current issue, which should have been long closed due to the original poster's inactivity and the migration of the related code in Gensim 3.7.
Assertion Error + SoftCosineSimilarity = Not related? I will present the full code if You'll try to resolve the issue. Do you prefer that I open a new issue?
The assertion error in this issue is supposed to come from the code in the pre-3.7 softcossim
method, which used to reside in gensim.matutils
and has since moved to the gensim.similarities.termsim
module. Your issue is with the gensim.models.keyedvectors
module.
def softcosinesim(texts):
model = Word2Vec(texts, size=20, min_count=1) # train word-vectors
termsim_index = WordEmbeddingSimilarityIndex(model)
dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(document) for document in texts]
similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix
docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
sims = docsim_index[bow_corpus] # calculate similarity of query to each doc from bow_corpus
return sims
Traceback (most recent call last): termsim_index = WordEmbeddingSimilarityIndex(model) File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1389, in init assert isinstance(keyedvectors, WordEmbeddingsKeyedVectors) AssertionError
What is wrong with this code that SoftCosineSimilarity doesn't like it? I tried to follow tutorial...
For some reason, your word embeddings do not have the WordEmbeddingsKeyedVectors
type. What type do they have?
I am using gensim Word2Vec to generate w2v_model.
Your issue above can be resolved by calling WordEmbeddingSimilarityIndex(model.wv)
instead of WordEmbeddingSimilarityIndex(model)
. I will update the code, so that it is more aware of the distinction between BaseAny2VecModel
(model
) and WordEmbeddingsKeyedVectors
(model.wv
).
I cannot reproduce your other issue, i.e. model.wv.similarity_matrix
throwing a TypeError
:
>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> dictionary = Dictionary(common_texts)
>>> model.wv.similarity_matrix(dictionary)
<12x12 sparse matrix of type '<type 'numpy.float32'>'
with 68 stored elements in Compressed Sparse Column format>
Can you run the above code without issue?
Can you run the above code without issue?
Yes, I can.
Now, few steps forward, for:
similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary)
I got:
NameError: global name 'TermSimilarityMatrix' is not defined
Please, try the following:
>>> from gensim.similarities import SparseTermSimilarityMatrix
>>>
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix File "/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.py", line 234, in init for term, similarity in index.most_similar(t1, num_rows) File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1401, in most_similar for t2, similarity in most_similar: TypeError: 'numpy.float32' object is not iterable
Maybe the problem is creating by terms like 'chemical_element' or 'cabinet_minister' with underlines?
I cannot reproduce your issue with new embeddings:
>>> from gensim.corpora import Dictionary
>>> from gensim.models.keyedvectors import WordEmbeddingSimilarityIndex
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.similarities import SparseTermSimilarityMatrix
>>> from gensim.test.utils import common_texts
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>>
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv)
>>> dictionary = Dictionary(common_texts)
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
>>> similarity_matrix
<gensim.similarities.termsim.SparseTermSimilarityMatrix object at 0x7f822abc3d10>
Judging by the error message, model.wv.most_similar
returns a number, not an iterable. Can you print the result of model.wv.most_similar(positive=['chemical_element'], topn=2)
, please?
For common_texts, output is:
model.wv.most_similar(positive=['chemical_element'], topn=2)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-12-13a509737ea2> in <module>()
----> 1 model.wv.most_similar(positive=['chemical_element'], topn=2)
/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
541 mean.append(weight * word)
542 else:
--> 543 mean.append(weight * self.word_vec(word, use_norm=True))
544 if word in self.vocab:
545 all_words.add(self.vocab[word].index)
/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in word_vec(self, word, use_norm)
462 return result
463 else:
--> 464 raise KeyError("word '%s' not in vocabulary" % word)
465
466 def get_vector(self, word):
KeyError: "word 'chemical_element' not in vocabulary"
Can you please try with the embeddings that throw the TypeError: 'numpy.float32' object is not iterable
exception? I understand that these should contain an embedding for the word chemical_element
.
For my text it stops even before:
In [12]: similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-abfb8b1569f4> in <module>()
----> 1 similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.pyc in __init__(self, source, dictionary, tfidf, symmetric, positive_definite, nonzero_limit, dtype)
232 most_similar = [
233 (dictionary.token2id[term], similarity)
--> 234 for term, similarity in index.most_similar(t1, num_rows)
235 if term in dictionary.token2id]
236
/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in most_similar(self, t1, topn)
1399 else:
1400 most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
-> 1401 for t2, similarity in most_similar:
1402 if similarity > self.threshold:
1403 yield (t2, similarity**self.exponent)
TypeError: 'numpy.float32' object is not iterable
As you can see on line 1400 in the error message above, SparseTermSimilarityMatrix
calls model.wv.most_similar
internally. According to the error message, the result of calling model.wv.most_similar
is a float, not an iterable. This is highly suspect.
Therefore, can you please print the result of model.wv.most_similar(positive=['chemical_element'], topn=2)
instead of calling the SparseTermSimilarityMatrix
constructor? As you noted, there is no issue when you construct the model using common_texts
, so this seems to be an issue with your embeddings.
Thank you for your patience: :)
In [13]: model.wv.most_similar(positive=['chemical_element'], topn=2)
Out[13]: [('inhabitant', 0.93882817029953), ('the', 0.9326512813568115)]
This seems pretty iterable to me.
Does my text make an error at your computer?
Let's try to closely imitate the call on line 1400. Can you please print the result of the following:
>>> termsim_index.kwargs
>>> termsim_index.keyedvectors
>>> most_similar = termsim_index.keyedvectors.most_similar(positive=['chemical_element'], topn=100)
>>> most_similar
>>> type(most_similar)
>>> '__iter__' in most_similar
My text does not make an error at your computer?
What is your text? Nevermind, I see it now.
In [15]: termsim_index.kwargs
Out[15]: {}
In [16]: termsim_index.keyedvectors
Out[16]: <gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7f18fca001d0>
In [17]: most_similar = termsim_index.keyedvectors.most_similar(positive=['chemical_element'], topn=100)
In [18]: model.wv.most_similar(positive=['chemical_element'], topn=2)
Out[18]: [('inhabitant', 0.93882817029953), ('the', 0.9326512813568115)]
In [19]: most_similar
Out[19]:
[('inhabitant', 0.93882817029953),
('the', 0.9326512813568115),
('give', 0.93118816614151),
('have', 0.9303354620933533),
('act', 0.928438663482666),
('one', 0.9224538803100586),
('to', 0.9192750453948975),
('china', 0.9141452312469482),
('associate_degree', 0.9119006395339966),
('with', 0.9078292846679688),
('be', 0.9045330286026001),
('statement', 0.898809552192688),
('which', 0.8987339735031128),
('vote', 0.89339280128479),
('time_period', 0.89242023229599),
('of', 0.8907227516174316),
('playing_card', 0.8895689249038696),
('first', 0.8888296484947205),
('oregon', 0.8866477608680725),
('merkel', 0.8860795497894287),
('person', 0.8851599097251892),
('from', 0.8846421241760254),
('in', 0.8816125988960266),
('and', 0.8810371160507202),
('this', 0.8801096081733704),
('make', 0.8777452707290649),
('meet', 0.8769802451133728),
('besides', 0.8752848505973816),
('angular_distance', 0.873124361038208),
('that', 0.8714672327041626),
('on', 0.8699379563331604),
('other', 0.8691580891609192),
('change', 0.8684202432632446),
('obama', 0.8667253851890564),
('communication', 0.8621015548706055),
('engineering', 0.8615524172782898),
('some', 0.8598195314407349),
('now', 0.8572754859924316),
('exchange', 0.8560868501663208),
('for', 0.8554658889770508),
('title', 0.8532639741897583),
('express', 0.8532208204269409),
('right', 0.8518909811973572),
('head_of_state', 0.847177267074585),
('free', 0.846038281917572),
('remove', 0.8458209037780762),
('germany', 0.8454596996307373),
('union', 0.8446109294891357),
('would', 0.8416316509246826),
('faculty', 0.8411930799484253),
('weekday', 0.8399801850318909),
('merely', 0.8379250764846802),
('we', 0.8371882438659668),
('political_unit', 0.8370255827903748),
('work', 0.8348655104637146),
('take', 0.8348475694656372),
('administrative_district', 0.8343826532363892),
('tpp', 0.833882749080658),
('administrator', 0.8318067789077759),
('united_nations_agency', 0.8316440582275391),
('washington', 0.8313312530517578),
('politician', 0.8289576768875122),
('legislature', 0.8287457227706909),
('plan_of_action', 0.8201491832733154),
('management', 0.8187181949615479),
('federal', 0.8167140483856201),
('new', 0.8154265880584717),
('travel', 0.8148607015609741),
('not', 0.8135936856269836),
('about', 0.8135201334953308),
('republican', 0.8131340742111206),
('him', 0.8047671318054199),
('by', 0.8038091659545898),
('associate', 0.8037841320037842),
('activity', 0.8029162287712097),
('structure', 0.8025172352790833),
('pacific', 0.799057126045227),
('point', 0.7987416982650757),
('more', 0.7969338893890381),
('message', 0.7965559959411621),
('organization', 0.7899693250656128),
('digit', 0.7889872789382935),
('connect', 0.7889586687088013),
('when', 0.7868154048919678),
('result', 0.7862980961799622),
('his', 0.7852383852005005),
('they', 0.783265233039856),
('schulz', 0.7814303636550903),
('group_action', 0.7772569060325623),
('european', 0.7769173979759216),
('large_integer', 0.775283932685852),
('under', 0.7743880748748779),
('inform', 0.771774172782898),
('mexico', 0.7684292793273926),
('against', 0.7668302059173584),
('steinmeier', 0.7626404762268066),
('supply', 0.7593228816986084),
('better', 0.7585717439651489),
('support', 0.7579919695854187),
('change_state', 0.7550258636474609)]
In [20]: type(most_similar)
Out[20]: list
In [21]: '__iter__' in most_similar
Out[21]: False
I can reproduce this with your text and I am investigating.
The issue is that the most_similar
method returns weird results with topn=0
:
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> model.wv.most_similar(positive=['computer'], topn=0)
array([-0.1180886 , 0.32174808, -0.02938104, -0.21145007, 0.37524396,
-0.23777878, 0.99999994, -0.01436211, 0.36708638, -0.09770551,
0.05963777, 0.3810038 ], dtype=float32)
This is an undocumented behavior, which can be fixed by removing lines 554 and 555 in keyedvectors.py
. Sadly, I don't see how a caller can easily patch this up without changing the package code. Afterwards, you will get the expected result and, more importantly, SparseTermSimilarityMatrix
should now work.
>>> model.wv.most_similar(positive=['computer'], topn=0)
[]
The patches are now available in #2356. Thank you for your patience in helping discover the bug and sorry for the trouble. 😉
I have following code:-
model = KeyedVectors.load_word2vec_format('/home/vineet/Downloads/lemmatized-legal/no replacement/legal_lemmatized_no_replacement.bin', binary=True)
bow_corpus, doc_dict = corpora.MmCorpus('./bow_corpus.mm'), corpora.Dictionary.load('./doc_dict.dict')
# compute cosine similarity between word embeddings
termsim_index = WordEmbeddingSimilarityIndex(model)
# construct term similarity matrix
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, doc_dict)
And it gives me following error:-
File "word2vec.py", line 25, in <module>
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, doc_dict)
File "/home/vineet/.local/lib/python3.6/site-packages/gensim/similarities/termsim.py", line 264, in __init__
100.0 * matrix.getnnz() / matrix_order**2)
ZeroDivisionError: float division by zero
What can be probable reasons for it and how to resolve it?
It seems as though your matrix_order
is zero, which would indicate that your doc_dict
dictionary is empty, can you verify?
We should check for this and raise a ValueError
with a user-friendly message earlier in the constructor.
Hello there,
Maybe you can help me out with this real quick. I cannot run any of your examples. Not the one from https://radimrehurek.com/gensim/similarities/docsim.html, nor the one from this repo. All of them give me the following Assertion.
This is not working (other similaritiy measures of this module work fine):
Neither is this from the repo (I followed all previous steps):
Thanks in advance. I am trying to run this for two days now but nothing works.
Best, Dennis