piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

Error “too many values to unpack” when trying to get similiraties in Gensim using LDA model #2644

Open brandatastudio opened 4 years ago

brandatastudio commented 4 years ago

Problem description

I'm trying to use a trained LDA model to compare similarity between the models documents, stored in corpus, and new documents unseen by the model.

Steps/code/corpus to reproduce

'm using anaconda enviroment python 3.7, gensim 3.8.0, basically. I have my data as a dataframe that I separated in a test and training set, they both have this structure:

X_test and Xtrain dataframe format :

 id                                            alltext  
1710  3264537  [exmodelo, karen, mcdougal, asegura, mantuvo, ...   
8211  3272079  [grupo, socialista, pionero, supone, apoyar, n...   
1885  3263933  [parte, entrenador, zaragoza, javier, aguirre,...   
2481  3263744  [fans, hielo, fuego, saga, literaria, dio, pie...   
2975  3265302  [actividad, busca, repetir, tres, ediciones, a... 

already preprocessed.

This is the code I use for creating my model

id2word = corpora.Dictionary(X_train["alltext"])   
texts = X_train["alltext"]
corpus = [id2word.doc2bow(text) for text in texts]

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
    id2word=id2word,
    num_topics=20,
    random_state=100, 
    update_every=1, 
    chunksize=400, 
    passes=10, 
    alpha='auto',
    per_word_topics=True)

Until here, everything works fine. I can effectively use

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]](url)

to get my topics.

The problem comes, when I try to compare similarity between a new document and the corpus. Here is the code I'm using

newddoc = X_test["alltext"][2730] #I get a particular instance of the test_set
new_doc_freq_vector = id2word.doc2bow(newddoc)  #vectorize its list of words
model_vec= lda_model[new_doc_freq_vector] #run the trained model on it
index = similarities.MatrixSimilarity(lda_model[corpus]) # error
sims = index[model_vec] #error

In the last two lines, I get this error:

-------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-110-352248c464f8> in <module>
      4 
      5 #index = Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)) #the first argument, the place where the
----> 6 index = similarities.MatrixSimilarity(lda_model[corpus]) # funciona si en vez de lda_model[corpus] usamos solo corpus
      7 index = similarities.MatrixSimilarity(model_vec)
      8 #sims = index[model_vec] #funciona si usamos index[new_doc_freq_vector] en vez de model_vec

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\similarities\docsim.py in __init__(self, corpus, num_best, dtype, num_features, chunksize, corpus_len)
    776                 "scanning corpus to determine the number of features (consider setting `num_features` explicitly)"
    777             )
--> 778             num_features = 1 + utils.get_max_id(corpus)
    779 
    780         self.num_features = num_features

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in get_max_id(corpus)
    734     for document in corpus:
    735         if document:
--> 736             maxid = max(maxid, max(fieldid for fieldid, _ in document))
    737     return maxid
    738 

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in <genexpr>(.0)
    734     for document in corpus:
    735         if document:
--> 736             maxid = max(maxid, max(fieldid for fieldid, _ in document))
    737     return maxid
    738 

ValueError: too many values to unpack (expected 2

Versions

anaconda enviroment python 3.7, gensim 3.8.0, Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
brandatastudio commented 4 years ago

Things I have tried to solve this:

1) Using

Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)).

But it did not work. Same error code was obtained.

2) If I replace lda_model[corpus] with corpus, and index[model_vec] with index[new_doc_freq_vector], similarities.MatrixSimilarity() works. But I believe it does not give the proper result because, it does not have the model information in there. The fact that it works it tells me it has something to do with data types (?), if I print lda_model[corpus] I get

<gensim.interfaces.TransformedCorpus object at 0x00000221ECA8E148> no Idea what this means though.

piskvorky commented 4 years ago

Can you show the output of:

print(texts[:10])
print(corpus[:10])
print(newddoc)
print(model_vec)
print(index)

Pandas has a pretty bizarre indexing and iterating/slicing system, with many un-Pythonic gotchas, so I'm not sure what your code is actually doing.

In general it's always a good idea to:

  1. post your logs at INFO level, and
  2. eyeball samples of data going in / coming out of your pipeline at various points (the log will show some of that too).
CelesteM commented 4 years ago

Have you resolved the problem? I'm facing something similar here.

mpenkov commented 4 years ago

@brandatastudio Ping. Could you please provide the info requested in this comment?

brandatastudio commented 4 years ago

Helllo everyone sorry for the delay, so inmmersed in the project have not come up with time earlier. yes, I think I found the cause though I have not developed an algorithm to overcome it . Here I answered in my own stack post https://stackoverflow.com/questions/58522356/error-too-many-values-to-unpack-when-trying-to-get-similiraties-in-gensim-usin/58566190#58566190 this same question with the help of other stack users. Basically, You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's ( check the post for more details). That's my hypothesis of what I think is the origin of the problem ( if someone else has more insight, and more validated conclussions please be welcome to post or correct), about the solution, I don't have one implemented yet but I will start my search from here when I eventually tackle the problem https://www.kaggle.com/ktattan/lda-and-document-similarity. Right now I am focused on investigating other aspects of my recommender system using LSI but eventually will try to create one with LDA., when I do, I will be sure to update here . Hope this helps, I don't have much more to offer at the time.

piskvorky commented 4 years ago

You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's

The output of LDA is in the exact same format as LSI: Gensim's sparse vector format.

If you're seeing something different (and you're reasonably certain it's not some user error on your side), please open a bug report.

brandatastudio commented 4 years ago

You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's

The output of LDA is in the exact same format as LSI: Gensim's sparse vector format.

If you're seeing something different (and you're reasonably certain it's not some user error on your side), please open a bug report.

For a vectorized document as this one vectorizedlsi

when you apply a trained, lda model the output is this

outputlda

and when you apply a trained lsi model, the output is this

outputlsi

both using the same number of topics for each of the trained models, the same corpus, and the same vectorized input document respectively. I thougth that this was justified as the math output of each, should be different as explained in this paper at page 4 , where it talks about the processing done to each of the models output to perform recommendation http://ceur-ws.org/Vol-1815/paper4.pdf .

If you are telling me that this lines of code should have the same output, to then calculate similarity as this

similarity calculus

then, obviously there is a bug. Because the Lda output is not the same as the lsis in format.

Just for completeness, here is the code used for calculating both lsi and lda models respectively.

lsi_model = gensim.models.LsiModel(corpus=vectorized_corpus, id2word=id2word, num_topics=20, chunksize=400, power_iters = 10)

lda_model = gensim.models.ldamodel.LdaModel(corpus=vectorized_corpus,
                                           id2word=id2word, 
                                           num_topics=20,  
                                           random_state=100, 
                                           update_every=1, 
                                           chunksize=400,
                                           passes=10, 
                                           alpha='auto', 
                                           per_word_topics=True)``

Please confirm that you are sure that their output should be the same format, if this is the case I need to proceed and repurt a bug or share my code with someone who can for sure determine that there is no user error on my part. Although I think I already did share it in this questions post.

piskvorky commented 4 years ago

You're explicitly requesting per_word_topics=True, which changes the output format. By default, LSI and LDA have the same output format.

See also the documentation at https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics

Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. @brandatastudio can you open a PR to fix that docstring? CC @mpenkov @menshikh-iv .

brandatastudio commented 4 years ago

You're explicitly requesting per_word_topics=True, which changes the output format. By default, LSI and LDA have the same output format.

See also the documentation at https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics

Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. CC @mpenkov @menshikh-iv .

I don't understand what you refer with " Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. "

I will try using per_word_topics = False and then trying to calculate similarity, and get to you, hopefully that is the solution of this post entirely.

piskvorky commented 4 years ago

Don't just try random parameters randomly. Choose the parameters that match your goal. Why did you set the (non-default) per_word_topics=True in the first place?

In any case, a PR fixing the docstring will be welcome.

brandatastudio commented 4 years ago

Sorry, closed it by accident. I need to revisit this problem, but not right now. Right now I'm busy, will get back to you asap , ps: what do you mean with the docstring?

piskvorky commented 4 years ago

I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True, neither of which matches your output. So the docstring should be improved: made correct and clear.

brandatastudio commented 4 years ago

I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True, neither of which matches your output. So the docstring should be improved: made correct and clear.

so if I understand correctly, you are not asking me to fix it. Not sure if you are talking to me there or not.

The reason I chose that value True was because it was my first experiment with the library, never used it before. Honestly, I don't think the documentation explains how that argument affects the output and the similarity calculation process ( in documentation, similarity is done with lsi as the example model , never seen an example of similarity calculation with LDA gensim ). I used true because I thought that that argument just added more information to the model but was not truly sure what it did ( as I don't see anywhere a clear explanation of that argument, the one here https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics is not dumb enough for me I guess )

I did not fully understand it and still don't, so I thougth I would just try it out. Seeing that the experiment failed, I proceeded posting this question here and on stack to get the feedback of people who know better than me because I thought they would know if that sort of thing was the problem. Sorry if it seems like random testing.

Seeing how you referenced that argument affecting the format, Could it be that, instead of giving me the topics probabilitites for a document as the LSI did, that argument causes the model to output the topic probability of each of the words in the document, and that is the reason for which the output is different? If this is the case, using False should solve the problem right? If not, an explanation of what that argument does would be indeed helpfull both for me, and the documentation in general.

Ps: will soon provide the output requested before, I thougth the problem was simpler so I did not give priority to it. sorry for the delay.

Can you show the output of:

print(texts[:10])
print(corpus[:10])
print(newddoc)
print(model_vec)
print(index)

Pandas has a pretty bizarre indexing and iterating/slicing system, with many un-Pythonic gotchas, so I'm not sure what your code is actually doing.

In general it's always a good idea to:

  1. post your logs at INFO level, and
  2. eyeball samples of data going in / coming out of your pipeline at various points (the log will show some of that too).
brandatastudio commented 4 years ago

Output requested

texts output texts output

Vectorized corpus ( corpus in the original question, I have renamed my code since then ) output vectorizedcorpusoutput

here up to three objects vectorized corpus to three

Newdoc output newdoc output

new_doc_freq_vec new_doc_frq_vector_output

and new model vec ( model vec in the original question ) outputlda

Index gives me no output, it's an error. The one mentioned on the question originally.

`

brandatastudio commented 4 years ago

Looks like that was the cause of the problem, similarity was effectively calculated using LDA after changing that argument to False.

Here is how the code looks like :

# Build LDA model lda_model = gensim.models.ldamodel.LdaModel(corpus=vectorized_corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=400, passes=10, alpha='auto' ) new_doc_vector again new_doc_freq_vector output_solution

the output for the new_model_vec again new_model_vec output the index_output now index output

and the similarity calculation output sim and sims sorted output

Thank you for your help.

piskvorky commented 4 years ago

No problem.

so if I understand correctly, you are not asking me to fix it. Not sure if you are talking to me there or not.

Yeah I meant for you to fix the docstring for LdaModel.get_document_topics, if you can. The current phrasing is too opaque, and fixing the docstring should be fairly trivial. It would help others looking at the documentation in the future.

mpenkov commented 4 years ago

@piskvorky Sounds like the most important part of this issue is:

I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True, neither of which matches your output. So the docstring should be improved: made correct and clear.

Right? Can we gloss over everything else?

piskvorky commented 4 years ago

Yes, the documentation is weird and confusing.