lmcinnes / enstop

Ensemble topic modelling with pLSA
BSD 2-Clause "Simplified" License
112 stars 12 forks source link

Call to plsa_refit fails due to missing sample_weight #9

Open ydennisy opened 4 years ago

ydennisy commented 4 years ago

When using model.transform() on new unseen data, the following error occurs:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-30746b9aeea8> in <module>
      1 test_corpus = df1['cleaned_text'].tolist()
      2 test_dtm = vectorizer.transform(test_corpus)
----> 3 test_doc_vecs = model.transform(test_dtm)
      4 labels = np.argmax(test_doc_vecs, axis=1)
      5 

/opt/conda/lib/python3.7/site-packages/enstop/enstop_.py in transform(self, X, y)
    836             n_iter_per_test=5,
    837             tolerance=0.001,
--> 838             random_state=random_state,
    839         )
    840 

TypeError: plsa_refit() missing 1 required positional argument: 'sample_weight'

There seems to be a missing arg here.

Seems a simple fix - I would be happy to make a PR, but I am not sure how to derive the needed arg:

    sample_weight: array of shape (n_docs,)
        Input document weights.

If @lmcinnes you can shed some light here - could be a quick fix!

lmcinnes commented 4 years ago

I thought, in fact, that this is something that had been fixed. The latest version in the repository should, I think, handle this correctly. I just haven't made a release to PyPI recently. Can you verify that the current master works for you on this issue?

ydennisy commented 4 years ago

@lmcinnes thanks for the quick reply! I will give it a go - but I linked to master, which does seem to be missing the args needed.

ydennisy commented 4 years ago

@lmcinnes I can confirm this issue is still occurring even after installing using:

pip install -U git+https://github.com/lmcinnes/enstop.git@067a813b14a95cb14bd87e263cfe0305b49bf03f which is referencing the most recent commit on master.

lmcinnes commented 4 years ago

Hmm, that's less good then :-( There is, at least, theoretically code that guards around this now. I'll have a look soon.

ydennisy commented 4 years ago

@lmcinnes let me know if you can give a little guidance - and I will be happy to try and help out here.

The topics I got on the first run were solid - so would love to continue using enstop!

lmcinnes commented 4 years ago

The short answer is that the sample_weights should effectively simply be all ones unless otherwise specified. In the newer code there is actually a bifurcation and a separate version of the m-step of the em-algorithm that uses sample weights and one that doesn't. Ideally if no sample weights are given we should be taking the latter path and not using sample weights at all.

ydennisy commented 4 years ago

@lmcinnes just to let you know I had a fix running locally which would always set the values of sample_weights to 1s, however there are issues with the kernel being restarted / dying also in this library - for which I am not able to find a cause or the error log.

lmcinnes commented 4 years ago

Hmm, that's more disconcerting. I have been doing a lot of messing around with it a month ago and may have subtly broken something. I'll try to get some time to look at all of this eventually. I certainly appreciate the feedback.

ydennisy commented 4 years ago

@lmcinnes appreciate your replies.

Can you give a rough estimate as to when you think you can take a look?

lmcinnes commented 4 years ago

Right now I have many other pressing things. I am unlikely to get to this until November at the earliest. As a workaround in the meantime you can force the sample weights to be ones. Ideally this shouldn't be a problem at all, so there is something astray somewhere, but setting a set of all one sample weights will brute force a workaround in the meantime while I try to figure out the underlying fix.

ydennisy commented 3 years ago

I would be happy to do this part - the issue with the kernel dying I am not really sure how to fix.

I will play around and report - if I find anything.

ydennisy commented 3 years ago

@lmcinnes Do you think this library is something you plan to maintain?

lmcinnes commented 3 years ago

It is in a bit of limbo right now as I have had to step away from this library to work on other topics.