Closed BeenKim closed 8 years ago
Yep, that is expected behavior. Each time you call get_document_topics, it will infer that given document's topic distribution again. It has no functionality for remembering what the documents it's seen in the past are made up of. However, the results themselves should be similar if the trained model is good. I'd suggest building an index of the results (maybe using a dict or list), and referring to those as needed.
It's been suggested that we should look at ways for producing deterministic results (#447), but that might be awhile off.
Hi @cscorley Could you please give a little bit more detailed reasons why the results are different each time I run the whole code? As my understanding, each time the computing reslut of the probability of each word should be the same, and the data (doc_set) was not changed, so I think it should be concrete and absolute same as previous result as math calculation. Is there some random variable in the calculation which makes the result slightly different?
@nasacj When LDA initializes, and even during inference, it uses randomized matrices that introduces noise to the model. This noise is tiny, and in general will not affect the end result much -- provided enough data. You can control for this in two ways:
random_state
parameter to the same state each time you create a new LDA model with any number of your choosing. This is one way you can achieve getting the same training/model results. documents => inference_results
. This has a memory cost, but will also reduce computation time by avoiding repeated inferences.Hope that helps!
Hi @cscorley , this makes sence to me, it is the randomized noise that make the result slightly different. Thank you for your explanation and suggestion.
I've been having this extremely weird issue: when I run the "sklearn_wrapper" notebook, it always produces the same result when I set random_state to a fixed value (1). However, this does not work in a regular Python script: sometimes the next result is the same as a previous one, and sometimes it is not. Any ideas? Is there something that would explain the difference in execution between notebooks and command-line? Thanks.
@yakov-suplari Could you please share the Python code that you are using?
Sure. Please see my code below as well as some bits of info about my Python, numpy, scipy, and gensim.
from gensim.sklearn_integration import SklLdaModel from gensim.corpora import Dictionary
texts = [ ['complier', 'system', 'computer'], ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'], ['graph', 'flow', 'network', 'graph'], ['loading', 'computer', 'system'], ['user', 'server', 'system'], ['tree', 'hamiltonian'], ['graph', 'trees'], ['computer', 'kernel', 'malfunction', 'computer'], ['server', 'system', 'computer'] ]
dictionary = Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] model = SklLdaModel(num_topics=2, id2word=dictionary, iterations=20, random_state=1) model.fit(corpus) print( model.transform(corpus) )
Python 3.5.3 |Anaconda 4.4.0 (64-bit)| (default, Mar 6 2017, 11:58:13) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information.
numpy.version '1.12.1' scipy.version '0.19.1' import gensim Using TensorFlow backend. gensim.version '2.2.0'
@yakov-suplari My apologies for the delayed response. So for the source code you have shared above, I am getting the same output everytime for Gensim 2.2.0 (and for both Python 2 and 3). And from my cursory initial investigation, I can't seem to find a reason as to why you are getting different outputs for different runs. I was curious if this issue is resolved for you. If not, I can try to look into this further.
Thanks for looking into this, Chinmaya. The issue has not gone away, but I can live with it. I have not seen such behavior before. It could be a very special set of parameters. Let's not worry about it. Thanks.
Hi guys! First of all thanks for the great library! I have also run into the issue that subsequent inferences on the same document are not deterministic and yield slightly different results every time.
Like @cscorley pointed out, that's due to the variational Bayesian inference done in LdaModel.inference: The document topic assignments gamma are randomly initialized from self.random_state, and inference is done until the mean change of the gammas drop below self.gamma_threshold (which can be set upon instantiating the LDA model). I am using a twofold fix for now to get reproducible results:
@cscorley also suggested to cache the assignments of already processed documents in an index. I was actually wondering if this could be added as a feature to the LdaModel? After all, the topic word probabilities are stored in self.lambda, why not have a similar field that stores the gammas for all documents (if requested)? This would be extremely handy!
If anyone come across this thread, I've found the correct solution in a japanese forum post (translate with the browser if you can't read japanese): https://qiita.com/suzuki_sh/items/8ad58bd962dfd8879cb3
To get reproducible results, specify the following before each run:
from gensim.utils import get_random_state
model = LdaModel(corpus, num_topics=2, random_state=0, id2word=dictionary)
model.random_state = get_random_state(0)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))
model.random_state = get_random_state(0)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))
model.random_state = get_random_state(0)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))
Hi,
I learned that ldamodel.get_document_topics gives slightly different value every time it is called, when ldamodel is fixed.
Here is code (modified version of this tutorial https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html) you can reproduce this behavior: