piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.67k stars 4.38k forks source link

Error in sklearn_api.ldamodel.LdaTransformer: Coherence scorer 'u_mass' #2084

Closed brycecf closed 6 years ago

brycecf commented 6 years ago

OS: macOS High Sierra 10.13.4 Python Version: 3.6.5 Gensim Version: 3.4.0

I am using the sklearn_api.ldamodel.LdaTransformer in an sklearn RandomizedSearchCV:

term_dct = Dictionary(docs)
bow_corpus = [term_dct.doc2bow(doc) for doc in docs]

param_dict = {
    'num_topics': [4, 6, 8, 20, 25, 40],
    'decay': [0.01, 0.05]
}
lda_model = LdaTransformer(id2word=term_dct,
                           passes=10,
                           iterations=100,
                           alpha='auto',
                           eta='auto',
                           scorer='u_mass')
lda_cv = RandomizedSearchCV(lda_model, param_dict, n_iter=12, n_jobs=5, cv=5, verbose=2)
lda_cv.fit(bow_corpus)

Once the hyperparameter search hits num_topics=20 and decay==0.01, I start getting these warnings:

/usr/local/opt/miniconda3/envs/.../lib/python3.6/site-packages/gensim/models/ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)

Then, at the conclusion of the search, I get a very long stack trace for TransportableException for a ZeroDivisionError.

ZeroDivisionError: float division by zero
___________________________________________________________________________

During handling of the above exception, another exception occurred:

JoblibZeroDivisionError                   Traceback (most recent call last)
<ipython-input-8-c2e0eda40a9d> in <module>()
----> 1 lda_cv.fit(bow_corpus)

/usr/local/opt/miniconda3/envs/.../lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    637                                   error_score=self.error_score)
    638           for parameters, (train, test) in product(candidate_params,
--> 639                                                    cv.split(X, y, groups)))
    640 
    641         # if one choose to see train score, "out" will contain train score info

/usr/local/opt/miniconda3/envs/.../lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time

/usr/local/opt/miniconda3/envs/.../lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self)
    738                     exception = exception_type(report)
    739 
--> 740                     raise exception
    741 
    742     def __call__(self, iterable):

JoblibZeroDivisionError: JoblibZeroDivisionError
___________________________________________________________________________

From what I understand of the u_mass formula, and my BoW representations, I am not sure what is going on here...

stacktrace.txt

brycecf commented 6 years ago

Looking at #1445, it looks like the issue is that one of the top n tokens do not appear in the fold that was left out, which leads me to my next question: is there anyway to address this and still use kfold cross-validation?