Open brandatastudio opened 4 years ago
Things I have tried to solve this:
1) Using
Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)).
But it did not work. Same error code was obtained.
2) If I replace lda_model[corpus] with corpus, and index[model_vec] with index[new_doc_freq_vector], similarities.MatrixSimilarity() works. But I believe it does not give the proper result because, it does not have the model information in there. The fact that it works it tells me it has something to do with data types (?), if I print lda_model[corpus] I get
<gensim.interfaces.TransformedCorpus object at 0x00000221ECA8E148> no Idea what this means though.
Can you show the output of:
print(texts[:10])
print(corpus[:10])
print(newddoc)
print(model_vec)
print(index)
Pandas has a pretty bizarre indexing and iterating/slicing system, with many un-Pythonic gotchas, so I'm not sure what your code is actually doing.
In general it's always a good idea to:
Have you resolved the problem? I'm facing something similar here.
@brandatastudio Ping. Could you please provide the info requested in this comment?
Helllo everyone sorry for the delay, so inmmersed in the project have not come up with time earlier. yes, I think I found the cause though I have not developed an algorithm to overcome it . Here I answered in my own stack post https://stackoverflow.com/questions/58522356/error-too-many-values-to-unpack-when-trying-to-get-similiraties-in-gensim-usin/58566190#58566190 this same question with the help of other stack users. Basically, You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's ( check the post for more details). That's my hypothesis of what I think is the origin of the problem ( if someone else has more insight, and more validated conclussions please be welcome to post or correct), about the solution, I don't have one implemented yet but I will start my search from here when I eventually tackle the problem https://www.kaggle.com/ktattan/lda-and-document-similarity. Right now I am focused on investigating other aspects of my recommender system using LSI but eventually will try to create one with LDA., when I do, I will be sure to update here . Hope this helps, I don't have much more to offer at the time.
You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's
The output of LDA is in the exact same format as LSI: Gensim's sparse vector format.
If you're seeing something different (and you're reasonably certain it's not some user error on your side), please open a bug report.
You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's
The output of LDA is in the exact same format as LSI: Gensim's sparse vector format.
If you're seeing something different (and you're reasonably certain it's not some user error on your side), please open a bug report.
For a vectorized document as this one
when you apply a trained, lda model the output is this
and when you apply a trained lsi model, the output is this
both using the same number of topics for each of the trained models, the same corpus, and the same vectorized input document respectively. I thougth that this was justified as the math output of each, should be different as explained in this paper at page 4 , where it talks about the processing done to each of the models output to perform recommendation http://ceur-ws.org/Vol-1815/paper4.pdf .
If you are telling me that this lines of code should have the same output, to then calculate similarity as this
then, obviously there is a bug. Because the Lda output is not the same as the lsis in format.
Just for completeness, here is the code used for calculating both lsi and lda models respectively.
lsi_model = gensim.models.LsiModel(corpus=vectorized_corpus, id2word=id2word, num_topics=20, chunksize=400, power_iters = 10)
lda_model = gensim.models.ldamodel.LdaModel(corpus=vectorized_corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=400,
passes=10,
alpha='auto',
per_word_topics=True)``
Please confirm that you are sure that their output should be the same format, if this is the case I need to proceed and repurt a bug or share my code with someone who can for sure determine that there is no user error on my part. Although I think I already did share it in this questions post.
You're explicitly requesting per_word_topics=True
, which changes the output format. By default, LSI and LDA have the same output format.
See also the documentation at https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics
Although that docstring shows two contradictory return value types for per_word_topics=True
– which one is correct? And the listing by @brandatastudio above actually matches neither. @brandatastudio can you open a PR to fix that docstring? CC @mpenkov @menshikh-iv .
You're explicitly requesting
per_word_topics=True
, which changes the output format. By default, LSI and LDA have the same output format.See also the documentation at https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics
Although that docstring shows two contradictory return value types for
per_word_topics=True
– which one is correct? And the listing by @brandatastudio above actually matches neither. CC @mpenkov @menshikh-iv .
I don't understand what you refer with " Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. "
I will try using per_word_topics = False and then trying to calculate similarity, and get to you, hopefully that is the solution of this post entirely.
Don't just try random parameters randomly. Choose the parameters that match your goal. Why did you set the (non-default) per_word_topics=True
in the first place?
In any case, a PR fixing the docstring will be welcome.
Sorry, closed it by accident. I need to revisit this problem, but not right now. Right now I'm busy, will get back to you asap , ps: what do you mean with the docstring?
I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True
, neither of which matches your output. So the docstring should be improved: made correct and clear.
I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when
per_word_topics=True
, neither of which matches your output. So the docstring should be improved: made correct and clear.
so if I understand correctly, you are not asking me to fix it. Not sure if you are talking to me there or not.
The reason I chose that value True was because it was my first experiment with the library, never used it before. Honestly, I don't think the documentation explains how that argument affects the output and the similarity calculation process ( in documentation, similarity is done with lsi as the example model , never seen an example of similarity calculation with LDA gensim ). I used true because I thought that that argument just added more information to the model but was not truly sure what it did ( as I don't see anywhere a clear explanation of that argument, the one here https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics is not dumb enough for me I guess )
I did not fully understand it and still don't, so I thougth I would just try it out. Seeing that the experiment failed, I proceeded posting this question here and on stack to get the feedback of people who know better than me because I thought they would know if that sort of thing was the problem. Sorry if it seems like random testing.
Seeing how you referenced that argument affecting the format, Could it be that, instead of giving me the topics probabilitites for a document as the LSI did, that argument causes the model to output the topic probability of each of the words in the document, and that is the reason for which the output is different? If this is the case, using False should solve the problem right? If not, an explanation of what that argument does would be indeed helpfull both for me, and the documentation in general.
Ps: will soon provide the output requested before, I thougth the problem was simpler so I did not give priority to it. sorry for the delay.
Can you show the output of:
print(texts[:10]) print(corpus[:10]) print(newddoc) print(model_vec) print(index)
Pandas has a pretty bizarre indexing and iterating/slicing system, with many un-Pythonic gotchas, so I'm not sure what your code is actually doing.
In general it's always a good idea to:
- post your logs at INFO level, and
- eyeball samples of data going in / coming out of your pipeline at various points (the log will show some of that too).
Output requested
texts output
Vectorized corpus ( corpus in the original question, I have renamed my code since then ) output
here up to three objects
Newdoc output
new_doc_freq_vec
and new model vec ( model vec in the original question )
Index gives me no output, it's an error. The one mentioned on the question originally.
`
Looks like that was the cause of the problem, similarity was effectively calculated using LDA after changing that argument to False.
Here is how the code looks like :
# Build LDA model lda_model = gensim.models.ldamodel.LdaModel(corpus=vectorized_corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=400, passes=10, alpha='auto' )
new_doc_vector again
the output for the new_model_vec again the index_output now
and the similarity calculation output
Thank you for your help.
No problem.
so if I understand correctly, you are not asking me to fix it. Not sure if you are talking to me there or not.
Yeah I meant for you to fix the docstring for LdaModel.get_document_topics
, if you can. The current phrasing is too opaque, and fixing the docstring should be fairly trivial. It would help others looking at the documentation in the future.
@piskvorky Sounds like the most important part of this issue is:
I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when
per_word_topics=True
, neither of which matches your output. So the docstring should be improved: made correct and clear.
Right? Can we gloss over everything else?
Yes, the documentation is weird and confusing.
Problem description
I'm trying to use a trained LDA model to compare similarity between the models documents, stored in corpus, and new documents unseen by the model.
Steps/code/corpus to reproduce
'm using anaconda enviroment python 3.7, gensim 3.8.0, basically. I have my data as a dataframe that I separated in a test and training set, they both have this structure:
X_test and Xtrain dataframe format :
already preprocessed.
This is the code I use for creating my model
Until here, everything works fine. I can effectively use
to get my topics.
The problem comes, when I try to compare similarity between a new document and the corpus. Here is the code I'm using
In the last two lines, I get this error:
Versions
anaconda enviroment python 3.7, gensim 3.8.0, Please provide the output of: