vi3k6i5 / GuidedLDA

semi supervised guided topic model with custom guidedLDA
Mozilla Public License 2.0
497 stars 108 forks source link

getting 'too many indices for array' error when trying to print out topic results #19

Open tgrover2 opened 5 years ago

tgrover2 commented 5 years ago

Hi there,

I'm trying to run this program using my own data, and the actual guided topic modeling fit as expected, but now using your code to print out the resulting seeded topics:

n_top_words = 10 topic_word = model.topic_word_ for i, topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1] print('Topic {}: {}'.format(i, ' '.join(topic_words)))

I am getting an error at topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1] of IndexError: too many indices for array.

My vocab object is a python dictionary as expected with the word as the key and the value as the ID, like in your tutorial.

{'level': 23949, 'nationalsozialistische': 27680, 'boyish': 4847, 'uprising': 44406, 'reached': 34053, 'infinitesimal': 20852, 'humiliated': 19720, 'fundraise': 16348, 'reprogram': 35089, 'nwf': 28830, 'impolite': 20381, 'upmu': 44393, 'stomp': 40042, 'reassertion': 34162, 'matthjews': 25541, 'kokesh': 23156, 'seize': 37167, 'proven': 32956, 'rted': 36093, 'streams': 40190, 'jvx': 22572, 'deformation': 10161, 'schoolkids': 36798, 'agonising': 865, 'skellington': 38332, 'xvideos': 46943, 'hills': 19027, 'francoist': 15947, 'hitters': 19140, 'urination': 44472, 'crowdfund': 9114, 'fivethirtyeight': 15321, 'flagbearers': 15362, 'shoah': 37862, 'uncritically': 43738, 'heretics': 18837, 'congressional': 8097, 'slayin': 38487, 'kickerdaily': 22901, 'blogging': 4382, 'riot': 35685, 'consciously': 8154, 'attention': 2656, 'tik': 42227, 'pfft': 31040, 'steppe': 39913, 'eigene': 12762, 'drag': 12040, 'insectivore': 21073, 'premiere': 32308, 'outing': 29750, 'citizenry': 6985, 'repute': 35126, 'savvy': 36620, 'artfag': 2289, 'twinkies': 43330, 'supporting': 40785, 'escaped': 13642, 'shhiiiieeeetttt': 37692, 'yellow': 47058, 'rationality': 33954, 'sighting': 38107, 'negotiation': 27908, 'adults': 612, 'overflowing': 29884 etc, etc...

Any insight into what I might be missing here or doing wrong would be greatly appreciated. I am more experienced with R than python so I'm not used to all the nuances of python.

Thanks in advance!

deepakkumar98355 commented 5 years ago

worked for me by using vocab = cv.get_feature_names()

model = guidedlda.GuidedLDA(n_topics=10, n_iter=500, random_state=7, refresh=20) model.fit(X)

topic_word = model.topicword n_top_words = 20 for i, topic_dist in enumerate(topic_word): topic_words = np.array(cv.get_feature_names())[np.argsort(topic_dist)][:-(n_top_words+1):-1] print('Topic {}: {}'.format(i, ' '.join(topic_words)))

arthi-rajendran24 commented 2 years ago

Hi @deepakkumar98355 What does "cv" represent here? I know cv doesn't have a function"get_feature_names".

arthi-rajendran24 commented 2 years ago

Hi @tgrover2 where you able to fix your issue? I am getting the same error. Please share your solution if you have any.