piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.59k stars 4.38k forks source link

get_document_topics gives slightly different value everytime it is called #591

Closed BeenKim closed 8 years ago

BeenKim commented 8 years ago

Hi,

I learned that ldamodel.get_document_topics gives slightly different value every time it is called, when ldamodel is fixed.

Here is code (modified version of this tutorial https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html) you can reproduce this behavior:


from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = get_stop_words('en')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health." 

# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:

      # clean and tokenize document string
      raw = i.lower()
      tokens = tokenizer.tokenize(raw)

      # remove stop words from tokens
      stopped_tokens = [i for i in tokens if not i in en_stop]

      # stem tokens
      stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

      # add tokens to list
      texts.append(stemmed_tokens)

      # turn our tokenized documents into a id <-> term dictionary
      dictionary = corpora.Dictionary(texts)

      # convert tokenized documents into a document-term matrix
      corpus = [dictionary.doc2bow(text) for text in texts]

      # generate LDA model
      ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
      print 'The following two must have the same value (but they do not)'
      print ldamodel.get_document_topics(corpus[0])
      print ldamodel.get_document_topics(corpus[0])
      print
cscorley commented 8 years ago

Yep, that is expected behavior. Each time you call get_document_topics, it will infer that given document's topic distribution again. It has no functionality for remembering what the documents it's seen in the past are made up of. However, the results themselves should be similar if the trained model is good. I'd suggest building an index of the results (maybe using a dict or list), and referring to those as needed.

It's been suggested that we should look at ways for producing deterministic results (#447), but that might be awhile off.

nasacj commented 8 years ago

Hi @cscorley Could you please give a little bit more detailed reasons why the results are different each time I run the whole code? As my understanding, each time the computing reslut of the probability of each word should be the same, and the data (doc_set) was not changed, so I think it should be concrete and absolute same as previous result as math calculation. Is there some random variable in the calculation which makes the result slightly different?

cscorley commented 8 years ago

@nasacj When LDA initializes, and even during inference, it uses randomized matrices that introduces noise to the model. This noise is tiny, and in general will not affect the end result much -- provided enough data. You can control for this in two ways:

  1. By setting the random_state parameter to the same state each time you create a new LDA model with any number of your choosing. This is one way you can achieve getting the same training/model results.
  2. Subsequent inferences, however, will get a new gamma matrix each time (as noted above) and will always produce slightly different results. I recommend caching inferences of documents in using an index or simple map of documents => inference_results. This has a memory cost, but will also reduce computation time by avoiding repeated inferences.

Hope that helps!

nasacj commented 8 years ago

Hi @cscorley , this makes sence to me, it is the randomized noise that make the result slightly different. Thank you for your explanation and suggestion.

yakov-suplari commented 7 years ago

I've been having this extremely weird issue: when I run the "sklearn_wrapper" notebook, it always produces the same result when I set random_state to a fixed value (1). However, this does not work in a regular Python script: sometimes the next result is the same as a previous one, and sometimes it is not. Any ideas? Is there something that would explain the difference in execution between notebooks and command-line? Thanks.

chinmayapancholi13 commented 7 years ago

@yakov-suplari Could you please share the Python code that you are using?

yakov-suplari commented 7 years ago

Sure. Please see my code below as well as some bits of info about my Python, numpy, scipy, and gensim.

from gensim.sklearn_integration import SklLdaModel from gensim.corpora import Dictionary

texts = [ ['complier', 'system', 'computer'], ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'], ['graph', 'flow', 'network', 'graph'], ['loading', 'computer', 'system'], ['user', 'server', 'system'], ['tree', 'hamiltonian'], ['graph', 'trees'], ['computer', 'kernel', 'malfunction', 'computer'], ['server', 'system', 'computer'] ]

dictionary = Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] model = SklLdaModel(num_topics=2, id2word=dictionary, iterations=20, random_state=1) model.fit(corpus) print( model.transform(corpus) )

Python 3.5.3 |Anaconda 4.4.0 (64-bit)| (default, Mar 6 2017, 11:58:13) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information.

numpy.version '1.12.1' scipy.version '0.19.1' import gensim Using TensorFlow backend. gensim.version '2.2.0'

chinmayapancholi13 commented 7 years ago

@yakov-suplari My apologies for the delayed response. So for the source code you have shared above, I am getting the same output everytime for Gensim 2.2.0 (and for both Python 2 and 3). And from my cursory initial investigation, I can't seem to find a reason as to why you are getting different outputs for different runs. I was curious if this issue is resolved for you. If not, I can try to look into this further.

yakov-suplari commented 7 years ago

Thanks for looking into this, Chinmaya. The issue has not gone away, but I can live with it. I have not seen such behavior before. It could be a very special set of parameters. Let's not worry about it. Thanks.

Darkdragon84 commented 6 years ago

Hi guys! First of all thanks for the great library! I have also run into the issue that subsequent inferences on the same document are not deterministic and yield slightly different results every time.

Like @cscorley pointed out, that's due to the variational Bayesian inference done in LdaModel.inference: The document topic assignments gamma are randomly initialized from self.random_state, and inference is done until the mean change of the gammas drop below self.gamma_threshold (which can be set upon instantiating the LDA model). I am using a twofold fix for now to get reproducible results:

  1. Siginficantly lower self.gamma_threshold to something reasonable, like 1e-10 or so (rather than the default 0.001). Variations between subsequent inferences should then be of that order or magnitude.
  2. Despite a lower threshold, sometimes it happens that inference ends up in different local minima, depending on the randim initial values of gamma. For that reason I reset the rng in self.random_state to a new one with the same seed every time I call get_document_topics (I know this is not optimal, but for now in my application good enough.)

@cscorley also suggested to cache the assignments of already processed documents in an index. I was actually wondering if this could be added as a feature to the LdaModel? After all, the topic word probabilities are stored in self.lambda, why not have a similar field that stores the gammas for all documents (if requested)? This would be extremely handy!

ChatAround-Dev commented 3 months ago

If anyone come across this thread, I've found the correct solution in a japanese forum post (translate with the browser if you can't read japanese): https://qiita.com/suzuki_sh/items/8ad58bd962dfd8879cb3

To get reproducible results, specify the following before each run:

from gensim.utils import get_random_state

model = LdaModel(corpus, num_topics=2, random_state=0, id2word=dictionary)
model.random_state = get_random_state(0)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))
model.random_state = get_random_state(0)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))
model.random_state = get_random_state(0)
print(model.get_document_topics(dictionary.doc2bow(["computer"])))