gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
627 stars 120 forks source link

Index error when using anchors #34

Open owlas opened 4 years ago

owlas commented 4 years ago

When using the fit method with anchors I get an index error from this line:

https://github.com/gregversteeg/corex_topic/blob/83991482523a8be2b2a8b9c864b273de96c2389a/corextopic/corextopic.py#L185

The error is understandable because if X is a 2d array, then X[:,i] is a 1d slice and thereforeX[:,i].mean(axis=1) is undefined because there is no dimension 1.

I've installed version corextopic==1.0.5 from pypi.

I can reproduce this for any arguments passed to anchors

owlas commented 4 years ago

Is a expected to be a list? I'm wondering if there is something missing in the preprocessing step for turning anchors of words into indices.

gregversteeg commented 4 years ago

Thanks for pointing this out. I think anchors has to be a list of lists: like anchors = [['cat', 'dog'], ['apple']]. Then "a" will be a list, and the slice will always produce an object that has axis=1! However! I note that the examples on the readme do not always match this intuition. I'll ask @ryanjgallagher to check if there is an issue with the readme.

topic_model.fit(X, words=words, anchors=[['dog','cat'], 'apple'], anchor_strength=2) Should be [['dog','cat'], ['apple']]?

topic_model.fit(X, words=words, anchors=['protest', 'protest', 'protest', 'riot', 'riot', 'riot'], anchor_strength=2) Should be [['protest'], ['protest']... ?

topic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado'], anchor_strength=2) This one is missing a bracket I think.

ryanjgallagher commented 4 years ago

The anchor input for the topic model should be a list. Within that list, the entries can be either strings, ints, or lists, and you should be able to do any combination of those for anchoring. Individual strings or ints (indicating you want to anchor only one word to a topic) are converted to lists with a single entry, and strings are converted to their corresponding column index.

So @gregversteeg isn't quite right, even if you pass just a string or int in the anchors list, it should be preprocessed properly. (He's right that the last example is missing a bracket though, my bad).

@owlas Can you provide a simple reproducible example? I can run the examples in the README without any issues using version 1.0.5, so I'm not sure what error you may be getting.

d-lowl commented 4 years ago

Can report the same for 1.0.6. Although I cannot reproduce the error consistently. For some sets of anchors it works, for some it throws an error (all of those should be valid according to the docs though)

ryanjgallagher commented 4 years ago

@d-lowl Would you be able to provide a minimal reproducible example that does fail?

d-lowl commented 4 years ago

Yeah, I made it work for my case (it was partially a problem in my code), but I still think that there are some edge cases. I will play around with it tomorrow and try to come up with one.

joelplantinga commented 11 months ago

Hi! I am not sure if this repo is still maintained but I ran into the same issue.

I found that single item anchor lists are transferred (back) into single items here: https://github.com/gregversteeg/corex_topic/blob/beea64bc41e62dffc5fb87deb506a3e253be0a6c/corextopic/corextopic.py#L367C27-L367C27

This would contradict @ryanjgallagher comment:

Individual strings or ints (indicating you want to anchor only one word to a topic) are converted to lists with a single entry

Removing the edge case for single item lists solved the problem for me:

if len(new_anchor_list) == 0:
    continue
# if len(new_anchor_list) == 1:
#    processed_anchors.append(new_anchor_list[0])
else:
    processed_anchors.append(new_anchor_list)