piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.67k stars 4.38k forks source link

gensim.models.LDAmodel producing NaN & same words in each topic #2115

Closed czhao028 closed 5 years ago

czhao028 commented 6 years ago

Description

Here is a brief introduction on StackOverflow; I thought I'd post this here too because the other StackOverflow question with the exact same issue as mine hasn't gotten even a single response in 2 weeks.

Link: https://stackoverflow.com/questions/51142294/gensim-ldamodel-error-nan-and-all-topics-the-same

Steps/Code/Corpus to Reproduce

#create pandas frame object w/ default rows
def tokenize(pd_object):
    for i, row in pd_object.iterrows():
        id = row["ID"]
        sentences = split_sentences(str(row["Comment"]))
        """ **Time Consuming** """
        tokens =  [[id, sent, gensim.parsing.preprocessing.preprocess_string(sent.lower(), filters=[strip_punctuation,
            strip_multiple_whitespaces, strip_numeric, strip_short, wordnet_stem])] for sent in sentences]
#append tokens to new pandas dataframe object 
def train(pd_object):
    t1 = time.time()
    phrases_and_tokens = tokenize(pd_object)
    bag_of_words = phrases_and_tokens["Tokens"].tolist()
    t2 = time.time()
    print("Time Taken %12f" % (t2-t1))

    bigram = gensim.models.Phrases(bag_of_words, threshold=1)
    bigram_mod = gensim.models.phrases.Phraser(bigram)

    texts = [filter_stop(bigram_mod[t]) for t in bag_of_words]

    id2word = corpora.Dictionary(texts)
    sent_wordfreq = [id2word.doc2bow(sent) for sent in texts]

    lda_model = gensim.models.ldamodel.LdaModel(corpus=sent_wordfreq,
                                                id2word=id2word,
                                                num_topics=5)

    print(lda_model.print_topics())

-->

Expected Results

Something like this:

[(0,
  '0.025*"game" + 0.018*"team" + 0.016*"year" + 0.014*"play" + 0.013*"good" + '
  '0.012*"player" + 0.011*"win" + 0.007*"season" + 0.007*"hockey" + '
  '0.007*"fan"'),
 (1,
  '0.021*"window" + 0.015*"file" + 0.012*"image" + 0.010*"program" + '
  '0.010*"version" + 0.009*"display" + 0.009*"server" + 0.009*"software" + '
  '0.008*"graphic" + 0.008*"application"'),
 (2,
  '0.021*"gun" + 0.019*"state" + 0.016*"law" + 0.010*"people" + 0.008*"case" + '
  '0.008*"crime" + 0.007*"government" + 0.007*"weapon" + 0.007*"police" + '
  '0.006*"firearm"'),
 (3,
  '0.855*"ax" + 0.062*"max" + 0.002*"tm" + 0.002*"qax" + 0.001*"mf" + '
  '0.001*"giz" + 0.001*"_" + 0.001*"ml" + 0.001*"fp" + 0.001*"mr"'),
 (4,
  '0.020*"file" + 0.020*"line" + 0.013*"read" + 0.013*"set" + 0.012*"program" '
  '+ 0.012*"number" + 0.010*"follow" + 0.010*"error" + 0.010*"change" + '
  '0.009*"entry"'),
 (5,
  '0.021*"god" + 0.016*"christian" + 0.008*"religion" + 0.008*"bible" + '
  '0.007*"life" + 0.007*"people" + 0.007*"church" + 0.007*"word" + 0.007*"man" '
  '+ 0.006*"faith"'),
 (..truncated..)]

Actual Results

[(0, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ....
(1, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ...
(2, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ...
(3, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ..)
(4, 'nan*"datalabs" + nan*"india" + nan*"frequently" + nan*"inconsistency" + nan*"standard" + ..)]

Please paste or specifically describe the actual output or traceback. -->

Versions

>>> import platform; print(platform.platform())
Darwin-17.6.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.14.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.1.0
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.4.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

I think it probably has to do with a numpy issue but all my attempts to upgrade and reinstall have been fruitless. Another coworker ran this on his computer and it worked just fine. Probably a recent update in numpy has caused this recent issue (there's one other person who posted it on StackOverflow 2 weeks ago) but uninstalling packages has broken so many things that I don't want to take the risk. However, I am trying to learn how to use virtual environments and see if I can test out different versions of numpy with this code. Thank you! Hope to get a response soon.

groceryheist commented 6 years ago

I also encountered this issue. I believe it is a bug introduced in a recent version of Gensim. Downgrading to gensim 3.1.0 solved the problem for me.

RanAR90 commented 6 years ago

Hello all

I have also encountered the same problem, I am using gensim 3.5.0. I have trained couple other models earlier but they were all fine. I have only got this when I was trying to train on a corpus of 100K English wikipedia articles. I have got this warnings during the training: RuntimeWarning: divide by zero encountered in log diff = np.log(self.expElogbeta) RuntimeWarning: overflow encountered in add sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)

I was using 30 passes in the training, but when I changed to the passes to 1, I got normal Topics!! and only the divide by zero warning. Just thought sharing this might help.

Best Regards

menshikh-iv commented 6 years ago

Thanks for report @czhao028,

can you please share a small reproducible example (now I can't run your code, because this isn't complete, data is missing)?

czhao028 commented 6 years ago

I can't give you a sample of my data because it's confidential data, but phrases_and_tokens["Phrases"] is a pd.Series object containing rows of keywords created by this portion of the code:

gensim.parsing.preprocessing.preprocess_string(sent.lower(), filters=[strip_punctuation, strip_multiple_whitespaces, strip_numeric, strip_short, wordnet_stem] for sent in sentences

after reviewing the tokenize method, it's outdated so I've included the most recent version below:

screen shot 2018-07-31 at 10 19 45 am

where token_helper is essentially the first line of code I mentioned earlier in this comment

menshikh-iv commented 6 years ago

@czhao028 that's really sad because if we can't reproduce an issue - we have no chance to fix it, can you try to reproduce this error with a publicly available dataset please?

czhao028 commented 6 years ago

@RaniemAR you wanna jump in here?

menshikh-iv commented 6 years ago

@czhao028 ping

snollygoster123123 commented 6 years ago

I do have the same problem. For me as soon as I try 70 or more topics, I only get NaN. I tried it on two different computers and I tried a lot of combinations between gensim versions and numpy version. Nothing helped.

menshikh-iv commented 6 years ago

Hi @snollygoster123123, can you give more information (exact code with dataset, python/os/gensim version), we need to reproduce this issue first.

snollygoster123123 commented 6 years ago

Hi @menshikh-iv, I solved the problem by taking the singlecore LDA Model. My Dataset consists of 60k Documents (Each approximately as long as a Wikipedia article ). Worked fine with LDA Multicore for 10-60 topics. Anything above will result in only NaNs. The line for the lda was:

lda = gensim.models.ldamulticore.LdaMulticore(corpus, 
    id2word=dictionary, num_topics=80, chunksize=1800, passes=20, 
    workers=1, eval_every=1, iterations=1000)

I think my post is wrong here in this issue, because OP is using single core. If you want to, you can delete my post or move it.

menshikh-iv commented 6 years ago

@snollygoster123123 can you share corpus please?

snollygoster123123 commented 6 years ago

@menshikh-iv Corpus is 170MB. The only way I had to upload it was uploaded. http://uploaded.net/file/3bzy5v6p In case you have a better way, please let me know. Also I do have the same problem now, also for the single core version (whenever I use 80 or more topics). If you are able to generate 80 topics, please let me know.

piskvorky commented 6 years ago

@menshikh-iv if downgrading to Gensim 3.1.0 helps like @groceryheist says, it must be an issue with the recent additions. IIRC, there was some PR that reimplemented parts of LDA in C/Cython, right?

Maybe it used the wrong precision (floats, single precision)? If the error is due to such numeric issues, I'm thinking it's possible it only manifests itself on larger datasets, and so our unit tests didn't catch it.

csmyth76 commented 6 years ago

If you want to reproduce the error: I get it when I run the code here: https://datascienceplus.com/topic-modeling-in-python-with-nltk-and-gensim/

...and remove:
if random.random() > .99:

There's a link on that page to the github that has the code and corpus.

menshikh-iv commented 6 years ago

@piskvorky I think you are right, this is definitely a numeric issue, also, can be related with #1927

piskvorky commented 6 years ago

@csmyth76 @snollygoster123123 @RaniemAR @czhao028 any appetite for looking into this and fixing the numerical bug?

I don't know when we'll get to this ourselves, so help would be welcome. It looks like a serious issue with a potentially simple fix.

johann-petrak commented 6 years ago

I just got the same problem, out of the blue, running gensim version 3.4.0.

anaconda3/lib/python3.6/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
diff = np.log(self.expElogbeta)

I have run the same task without problems with slightly different versions of the corpus. So it seems there is some very specific situation here which cannot be easily be reproduced by a minimal test case. I sadly also cannot share the corpus as it is licensed.

I do not really understand the code enough to be of much help but would a simple guard against trying to get log(0) and setting diff to 0 in that case be a workaround here?

Apparently expElogbeta is set to np.exp(self.state.get_Elogbeta()) which means that the result of self.state.get_Elogbeta() must be -Inf which in turn means that dirichlet_expectation(self.get_lambda()) must be -Inf which means that self.get_lambda() must be zero? Not sure how that could ever happen or if my train of thought is wrong here ...

johann-petrak commented 6 years ago

This closed issue appears to be about the same problem and may contain relevant information: #217

johann-petrak commented 6 years ago

OK, I checked and in my case there are many values in self.state.sstats which are zero. Then self.expElogbeta = np.exp(dirichlet_expectation(self.state.sstats)) and diff = np.log(self.expElogbeta) and then taking the mean of anything that has at least one Inf value in it causes the topic diff to be Inf.

Now, I do not know exactly what the implications should be if some sstats are zero, but I think they should definitely not have an influence on the topic diff like this, but maybe also not on other locations where we get +/- Inf or NaN because of those zeroes? The code appears to alternate between calculating the exp and log (which is -Inf for 0) quite frequently, and the digamma function for the dirichlet expectation (which is Inf for 0). Maybe there is a strategy to correctly handle the calculation of these functions for values which are ultimately coming from those zero counts?

zkwhandan commented 5 years ago

I find a solution to solve this problem. At line 666 in ldamodel.py, there is a TODO. # TODO treat zeros explicitly, instead of adding epsilon? eps = DTYPE_TO_EPS[self.dtype] phinorm = np.dot(expElogthetad, expElogbetad) + eps

this eps is too small. When I increase it, non disappear.

create a file:

from gensim.models.ldamodel import *

DTYPE_TO_EPS = {
    np.float16: 1e-5,
    np.float32: 1e-25,      # <<<<=========== THE VALUE I CHANGE ===========
    np.float64: 1e-100,
}

def inference(self, chunk, collect_sstats=False):
    try:
        len(chunk)
    except TypeError:
        # convert iterators/generators to plain list, so we have len() etc.
        chunk = list(chunk)
    if len(chunk) > 1:
        logger.debug("performing inference on a chunk of %i documents", len(chunk))

    # Initialize the variational distribution q(theta|gamma) for the chunk
    gamma = self.random_state.gamma(100., 1. / 100., (len(chunk), self.num_topics)).astype(self.dtype, copy=False)
    Elogtheta = dirichlet_expectation(gamma)
    expElogtheta = np.exp(Elogtheta)

    assert Elogtheta.dtype == self.dtype
    assert expElogtheta.dtype == self.dtype

    if collect_sstats:
        sstats = np.zeros_like(self.expElogbeta, dtype=self.dtype)
    else:
        sstats = None
    converged = 0

    for d, doc in enumerate(chunk):
        if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
            # make sure the term IDs are ints, otherwise np will get upset
            ids = [int(idx) for idx, _ in doc]
        else:
            ids = [idx for idx, _ in doc]
        cts = np.array([cnt for _, cnt in doc], dtype=self.dtype)
        gammad = gamma[d, :]
        Elogthetad = Elogtheta[d, :]
        expElogthetad = expElogtheta[d, :]
        expElogbetad = self.expElogbeta[:, ids]

        # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.
        # phinorm is the normalizer.
        # TODO treat zeros explicitly, instead of adding epsilon?
        eps = DTYPE_TO_EPS[self.dtype]
        phinorm = np.dot(expElogthetad, expElogbetad) + eps

        # Iterate between gamma and phi until convergence
        for _ in xrange(self.iterations):
            lastgamma = gammad
            gammad = self.alpha + expElogthetad * np.dot(cts / phinorm, expElogbetad.T)
            Elogthetad = dirichlet_expectation(gammad)
            expElogthetad = np.exp(Elogthetad)
            phinorm = np.dot(expElogthetad, expElogbetad) + eps
            # If gamma hasn't changed much, we're done.
            meanchange = mean_absolute_difference(gammad, lastgamma)
            if meanchange < self.gamma_threshold:
                converged += 1
                break
        gamma[d, :] = gammad
        assert gammad.dtype == self.dtype
        if collect_sstats:
            # Contribution of document d to the expected sufficient
            # statistics for the M step.
            sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)

    if len(chunk) > 1:
        logger.debug("%i/%i documents converged within %i iterations", converged, len(chunk), self.iterations)

    if collect_sstats:
        # This step finishes computing the sufficient statistics for the
        # M step, so that
        # sstats[k, w] = \sum_d n_{dw} * phi_{dwk}
        # = \sum_d n_{dw} * exp{Elogtheta_{dk} + Elogbeta_{kw}} / phinorm_{dw}.
        sstats *= self.expElogbeta
        assert sstats.dtype == self.dtype

    assert gamma.dtype == self.dtype
    return gamma, sstats

def modify_lda_inference():
    LdaModel.inference = inference

Usage:

from lda_model_modify import modify_lda_inference
modify_lda_inference()
from gensim.models import LdaMulticore
menshikh-iv commented 5 years ago

Nice catch, thanks @zkwhandan :+1:

Yukisu03 commented 5 years ago

I also meet the problem when I ran an LDA from Gensim library. Here is the error:

/anaconda3/lib/python3.6/site-packages/gensim/models/ldamodel.py:678: RuntimeWarning: overflow encountered in exp expElogthetad = np.exp(Elogthetad).

After going through the answers mentioned above, I tried to update my Numpy version and Gensim version to the updated one. However, the problem is still here. My dataset includes about 10,000 tweets. Btw, I tried to use 5 tweets, it seems no problem in generating the topics.

Hope to get a response soon. Thank you!

notAmine commented 5 years ago

I'm encountering the same issue when using a large number of topics (+200)