piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.65k stars 4.37k forks source link

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x. #2105

Closed DennisCologne closed 5 years ago

DennisCologne commented 6 years ago

Hello there,

Maybe you can help me out with this real quick. I cannot run any of your examples. Not the one from https://radimrehurek.com/gensim/similarities/docsim.html, nor the one from this repo. All of them give me the following Assertion.

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x.

This is not working (other similaritiy measures of this module work fine):

from gensim.test.utils import common_texts
from gensim.corpora import Dictionary
from gensim.models import Word2Vec
from gensim.similarities import SoftCosineSimilarity

model = Word2Vec(common_texts, size=20, min_count=1)  # train word-vectors
dictionary = Dictionary(common_texts)
bow_corpus = [dictionary.doc2bow(document) for document in common_texts]

similarity_matrix = model.wv.similarity_matrix(dictionary)  # construct similarity matrix
index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)

# Make a query.
query = 'graph trees computer'.split()
# calculate similarity between query and each doc from bow_corpus
sims = index[dictionary.doc2bow(query)]

Neither is this from the repo (I followed all previous steps):

similarity = softcossim(sentence_obama, sentence_orange, similarity_matrix)
print('similarity = %.4f' % similarity)

Thanks in advance. I am trying to run this for two days now but nothing works.

Best, Dennis

piskvorky commented 6 years ago

@Witiko can you have a look?

Witiko commented 6 years ago

Hey @DennisCologne,

sorry to say I am the author of the code that gives you trouble. What Gensim and Python versions are you using? I can run the above code without issue with the PyPI version of Gensim (3.4.0), and Python 3.5 just fine.

>>> sims
[(6, 0.8305764039419705),
 (7, 0.7257781024707816),
 (5, 0.5584027708699971),
 (0, 0.43455470767273646),
 (8, 0.4082457402348116),
 (1, 0.3028528215099456),
 (3, 0.09251811314306692),
 (4, 0.07636744554253587),
 (2, 0.04509321490371689)]
DennisCologne commented 6 years ago

Hi @Witiko,

thank you for your answer.

Actually, it is Python 2.7.14 with Gensim 3.4.0... after further investigation, the matrix-vector multiplication returns a negative value even though all of the values in both are positive.

But you are right, I just tried it on my Python 3.6 environment and there it works fine. I guess I will use this environment than. But this problem might still be interesting for you.

Thanks again for the quick reply.

Best, Dennis

Witiko commented 6 years ago

Hey @DennisCologne,

this is definitely interesting, but I can't seem to reproduce your problem even with Python 2.7 and Gensim 3.4.0. Can you find a pair of document vectors vec1, and vec2 that trigger the issue, call softcossim(vec1, vec2), and share what the content of vec1, vec2, dense_matrix, vec1len, and vec2len is just before the failing assertion?

menshikh-iv commented 6 years ago

ping @DennisCologne, please provide information for reproducing an error (that requested in https://github.com/RaRe-Technologies/gensim/issues/2105#issuecomment-401103021)

menshikh-iv commented 6 years ago

ping @DennisCologne

tvrbanec commented 5 years ago

Similar issue with SoftCosineSimilarity. Please check at https://groups.google.com/forum/#!topic/gensim/WVTRdZONtrc Python2.7, gensim 3.7

piskvorky commented 5 years ago

ping @Witiko

Witiko commented 5 years ago

I fail to see how this is related to the current issue, which should have been long closed due to the original poster's inactivity and the migration of the related code in Gensim 3.7.

tvrbanec commented 5 years ago

Assertion Error + SoftCosineSimilarity = Not related? I will present the full code if You'll try to resolve the issue. Do you prefer that I open a new issue?

Witiko commented 5 years ago

The assertion error in this issue is supposed to come from the code in the pre-3.7 softcossim method, which used to reside in gensim.matutils and has since moved to the gensim.similarities.termsim module. Your issue is with the gensim.models.keyedvectors module.

tvrbanec commented 5 years ago
def softcosinesim(texts):
    model = Word2Vec(texts, size=20, min_count=1)  # train word-vectors
    termsim_index = WordEmbeddingSimilarityIndex(model)
    dictionary = Dictionary(texts)
    bow_corpus = [dictionary.doc2bow(document) for document in texts]
    similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix
    docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
    sims = docsim_index[bow_corpus]  # calculate similarity of query to each doc from bow_corpus
    return sims

Traceback (most recent call last): termsim_index = WordEmbeddingSimilarityIndex(model) File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1389, in init assert isinstance(keyedvectors, WordEmbeddingsKeyedVectors) AssertionError

What is wrong with this code that SoftCosineSimilarity doesn't like it? I tried to follow tutorial...

Witiko commented 5 years ago

For some reason, your word embeddings do not have the WordEmbeddingsKeyedVectors type. What type do they have?

tvrbanec commented 5 years ago

I am using gensim Word2Vec to generate w2v_model.

Witiko commented 5 years ago

Your issue above can be resolved by calling WordEmbeddingSimilarityIndex(model.wv) instead of WordEmbeddingSimilarityIndex(model). I will update the code, so that it is more aware of the distinction between BaseAny2VecModel (model) and WordEmbeddingsKeyedVectors (model.wv).

Witiko commented 5 years ago

I cannot reproduce your other issue, i.e. model.wv.similarity_matrix throwing a TypeError:

>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> dictionary = Dictionary(common_texts)
>>> model.wv.similarity_matrix(dictionary)
<12x12 sparse matrix of type '<type 'numpy.float32'>'
        with 68 stored elements in Compressed Sparse Column format>

Can you run the above code without issue?

tvrbanec commented 5 years ago

Can you run the above code without issue?

Yes, I can.

tvrbanec commented 5 years ago

Now, few steps forward, for: similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary) I got: NameError: global name 'TermSimilarityMatrix' is not defined

Witiko commented 5 years ago

Please, try the following:

>>> from gensim.similarities import SparseTermSimilarityMatrix
>>> 
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
tvrbanec commented 5 years ago

similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix File "/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.py", line 234, in init for term, similarity in index.most_similar(t1, num_rows) File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1401, in most_similar for t2, similarity in most_similar: TypeError: 'numpy.float32' object is not iterable

tvrbanec commented 5 years ago

Maybe the problem is creating by terms like 'chemical_element' or 'cabinet_minister' with underlines?

Witiko commented 5 years ago

I cannot reproduce your issue with new embeddings:

>>> from gensim.corpora import Dictionary
>>> from gensim.models.keyedvectors import WordEmbeddingSimilarityIndex
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.similarities import SparseTermSimilarityMatrix
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> 
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv)
>>> dictionary = Dictionary(common_texts)
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
>>> similarity_matrix
<gensim.similarities.termsim.SparseTermSimilarityMatrix object at 0x7f822abc3d10>

Judging by the error message, model.wv.most_similar returns a number, not an iterable. Can you print the result of model.wv.most_similar(positive=['chemical_element'], topn=2), please?

tvrbanec commented 5 years ago

For common_texts, output is:

model.wv.most_similar(positive=['chemical_element'], topn=2)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-13a509737ea2> in <module>()
----> 1 model.wv.most_similar(positive=['chemical_element'], topn=2)

/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
    541                 mean.append(weight * word)
    542             else:
--> 543                 mean.append(weight * self.word_vec(word, use_norm=True))
    544                 if word in self.vocab:
    545                     all_words.add(self.vocab[word].index)

/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in word_vec(self, word, use_norm)
    462             return result
    463         else:
--> 464             raise KeyError("word '%s' not in vocabulary" % word)
    465 
    466     def get_vector(self, word):

KeyError: "word 'chemical_element' not in vocabulary"
Witiko commented 5 years ago

Can you please try with the embeddings that throw the TypeError: 'numpy.float32' object is not iterable exception? I understand that these should contain an embedding for the word chemical_element.

tvrbanec commented 5 years ago

For my text it stops even before:

In [12]: similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-abfb8b1569f4> in <module>()
----> 1 similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)

/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.pyc in __init__(self, source, dictionary, tfidf, symmetric, positive_definite, nonzero_limit, dtype)
    232             most_similar = [
    233                 (dictionary.token2id[term], similarity)
--> 234                 for term, similarity in index.most_similar(t1, num_rows)
    235                 if term in dictionary.token2id]
    236 

/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in most_similar(self, t1, topn)
   1399         else:
   1400             most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
-> 1401             for t2, similarity in most_similar:
   1402                 if similarity > self.threshold:
   1403                     yield (t2, similarity**self.exponent)

TypeError: 'numpy.float32' object is not iterable
Witiko commented 5 years ago

As you can see on line 1400 in the error message above, SparseTermSimilarityMatrix calls model.wv.most_similar internally. According to the error message, the result of calling model.wv.most_similar is a float, not an iterable. This is highly suspect.

Therefore, can you please print the result of model.wv.most_similar(positive=['chemical_element'], topn=2) instead of calling the SparseTermSimilarityMatrix constructor? As you noted, there is no issue when you construct the model using common_texts, so this seems to be an issue with your embeddings.

tvrbanec commented 5 years ago

Thank you for your patience: :)

In [13]: model.wv.most_similar(positive=['chemical_element'], topn=2)
Out[13]: [('inhabitant', 0.93882817029953), ('the', 0.9326512813568115)]
Witiko commented 5 years ago

This seems pretty iterable to me.

tvrbanec commented 5 years ago

Does my text make an error at your computer?

Witiko commented 5 years ago

Let's try to closely imitate the call on line 1400. Can you please print the result of the following:

>>> termsim_index.kwargs
>>> termsim_index.keyedvectors
>>> most_similar = termsim_index.keyedvectors.most_similar(positive=['chemical_element'], topn=100)
>>> most_similar
>>> type(most_similar)
>>> '__iter__' in most_similar
Witiko commented 5 years ago

My text does not make an error at your computer?

What is your text? Nevermind, I see it now.

tvrbanec commented 5 years ago
In [15]: termsim_index.kwargs
Out[15]: {}

In [16]: termsim_index.keyedvectors
Out[16]: <gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7f18fca001d0>

In [17]: most_similar = termsim_index.keyedvectors.most_similar(positive=['chemical_element'], topn=100)

In [18]: model.wv.most_similar(positive=['chemical_element'], topn=2)
Out[18]: [('inhabitant', 0.93882817029953), ('the', 0.9326512813568115)]

In [19]: most_similar
Out[19]: 
[('inhabitant', 0.93882817029953),
 ('the', 0.9326512813568115),
 ('give', 0.93118816614151),
 ('have', 0.9303354620933533),
 ('act', 0.928438663482666),
 ('one', 0.9224538803100586),
 ('to', 0.9192750453948975),
 ('china', 0.9141452312469482),
 ('associate_degree', 0.9119006395339966),
 ('with', 0.9078292846679688),
 ('be', 0.9045330286026001),
 ('statement', 0.898809552192688),
 ('which', 0.8987339735031128),
 ('vote', 0.89339280128479),
 ('time_period', 0.89242023229599),
 ('of', 0.8907227516174316),
 ('playing_card', 0.8895689249038696),
 ('first', 0.8888296484947205),
 ('oregon', 0.8866477608680725),
 ('merkel', 0.8860795497894287),
 ('person', 0.8851599097251892),
 ('from', 0.8846421241760254),
 ('in', 0.8816125988960266),
 ('and', 0.8810371160507202),
 ('this', 0.8801096081733704),
 ('make', 0.8777452707290649),
 ('meet', 0.8769802451133728),
 ('besides', 0.8752848505973816),
 ('angular_distance', 0.873124361038208),
 ('that', 0.8714672327041626),
 ('on', 0.8699379563331604),
 ('other', 0.8691580891609192),
 ('change', 0.8684202432632446),
 ('obama', 0.8667253851890564),
 ('communication', 0.8621015548706055),
 ('engineering', 0.8615524172782898),
 ('some', 0.8598195314407349),
 ('now', 0.8572754859924316),
 ('exchange', 0.8560868501663208),
 ('for', 0.8554658889770508),
 ('title', 0.8532639741897583),
 ('express', 0.8532208204269409),
 ('right', 0.8518909811973572),
 ('head_of_state', 0.847177267074585),
 ('free', 0.846038281917572),
 ('remove', 0.8458209037780762),
 ('germany', 0.8454596996307373),
 ('union', 0.8446109294891357),
 ('would', 0.8416316509246826),
 ('faculty', 0.8411930799484253),
 ('weekday', 0.8399801850318909),
 ('merely', 0.8379250764846802),
 ('we', 0.8371882438659668),
 ('political_unit', 0.8370255827903748),
 ('work', 0.8348655104637146),
 ('take', 0.8348475694656372),
 ('administrative_district', 0.8343826532363892),
 ('tpp', 0.833882749080658),
 ('administrator', 0.8318067789077759),
 ('united_nations_agency', 0.8316440582275391),
 ('washington', 0.8313312530517578),
 ('politician', 0.8289576768875122),
 ('legislature', 0.8287457227706909),
 ('plan_of_action', 0.8201491832733154),
 ('management', 0.8187181949615479),
 ('federal', 0.8167140483856201),
 ('new', 0.8154265880584717),
 ('travel', 0.8148607015609741),
 ('not', 0.8135936856269836),
 ('about', 0.8135201334953308),
 ('republican', 0.8131340742111206),
 ('him', 0.8047671318054199),
 ('by', 0.8038091659545898),
 ('associate', 0.8037841320037842),
 ('activity', 0.8029162287712097),
 ('structure', 0.8025172352790833),
 ('pacific', 0.799057126045227),
 ('point', 0.7987416982650757),
 ('more', 0.7969338893890381),
 ('message', 0.7965559959411621),
 ('organization', 0.7899693250656128),
 ('digit', 0.7889872789382935),
 ('connect', 0.7889586687088013),
 ('when', 0.7868154048919678),
 ('result', 0.7862980961799622),
 ('his', 0.7852383852005005),
 ('they', 0.783265233039856),
 ('schulz', 0.7814303636550903),
 ('group_action', 0.7772569060325623),
 ('european', 0.7769173979759216),
 ('large_integer', 0.775283932685852),
 ('under', 0.7743880748748779),
 ('inform', 0.771774172782898),
 ('mexico', 0.7684292793273926),
 ('against', 0.7668302059173584),
 ('steinmeier', 0.7626404762268066),
 ('supply', 0.7593228816986084),
 ('better', 0.7585717439651489),
 ('support', 0.7579919695854187),
 ('change_state', 0.7550258636474609)]

In [20]: type(most_similar)
Out[20]: list

In [21]: '__iter__' in most_similar
Out[21]: False
Witiko commented 5 years ago

I can reproduce this with your text and I am investigating.

Witiko commented 5 years ago

The issue is that the most_similar method returns weird results with topn=0:

>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> model.wv.most_similar(positive=['computer'], topn=0)
array([-0.1180886 ,  0.32174808, -0.02938104, -0.21145007,  0.37524396,
       -0.23777878,  0.99999994, -0.01436211,  0.36708638, -0.09770551,
        0.05963777,  0.3810038 ], dtype=float32)

This is an undocumented behavior, which can be fixed by removing lines 554 and 555 in keyedvectors.py. Sadly, I don't see how a caller can easily patch this up without changing the package code. Afterwards, you will get the expected result and, more importantly, SparseTermSimilarityMatrix should now work.

>>> model.wv.most_similar(positive=['computer'], topn=0)
[]
Witiko commented 5 years ago

The patches are now available in #2356. Thank you for your patience in helping discover the bug and sorry for the trouble. 😉

Vineet-Sharma29 commented 4 years ago

I have following code:-

model = KeyedVectors.load_word2vec_format('/home/vineet/Downloads/lemmatized-legal/no replacement/legal_lemmatized_no_replacement.bin', binary=True)

bow_corpus, doc_dict = corpora.MmCorpus('./bow_corpus.mm'), corpora.Dictionary.load('./doc_dict.dict')

# compute cosine similarity between word embeddings
termsim_index = WordEmbeddingSimilarityIndex(model)

# construct term similarity matrix
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, doc_dict)

And it gives me following error:-

File "word2vec.py", line 25, in <module>
    similarity_matrix = SparseTermSimilarityMatrix(termsim_index, doc_dict)
  File "/home/vineet/.local/lib/python3.6/site-packages/gensim/similarities/termsim.py", line 264, in __init__
    100.0 * matrix.getnnz() / matrix_order**2)
ZeroDivisionError: float division by zero

What can be probable reasons for it and how to resolve it?

Witiko commented 4 years ago

It seems as though your matrix_order is zero, which would indicate that your doc_dict dictionary is empty, can you verify? We should check for this and raise a ValueError with a user-friendly message earlier in the constructor.