joewandy / hlda

Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model
GNU General Public License v3.0
145 stars 38 forks source link

Number of topics within the levels? #5

Closed reallynotabot closed 6 years ago

reallynotabot commented 6 years ago

How does the hLDA decide the number of topics within the levels? And can we set it?

From the results in bbc_test notebook for 50 iterations: For topic = 4 level = 1 there are 3 subtopics

topic=0 level=0 (documents=401): peopl, thi, use, technolog, get, 
    topic=1 level=1 (documents=148): user, mobil, network, servic, softwar, 
        topic=2 level=2 (documents=78): email, secur, viru, net, firm, 
        topic=3 level=2 (documents=53): appl, music, patent, mac, law, 
        topic=12 level=2 (documents=10): game, yahoo, learn, sim, educ, 
        topic=27 level=2 (documents=7): ink, print, elect, film, cinema, 
    topic=4 level=1 (documents=40): mobil, phone, top, game, like, 
        topic=5 level=2 (documents=10): radio, podcast, listen, hiphop, world, 
        topic=16 level=2 (documents=16): player, librari, blog, survey, american, 
        topic=23 level=2 (documents=14): game, award, play, titl, mobil, 

However, when I run the same thing, for the topic =3 level=1 I get 4 subtopics (see below).

The topic number is also different - it's topic=3 instead of topic=4 like your example.

topic 0 (level=0, total_words=46224, documents=401): peopl, use, thi, technolog, one, 
    topic 1 (level=1, total_words=5442, documents=106): game, system, player, play, new, 
        topic 2 (level=2, total_words=6272, documents=62): servic, mobil, music, phone, digit, 
        topic 7 (level=2, total_words=3173, documents=23): game, blog, simonetti, bittorr, file, 
        topic 8 (level=2, total_words=1981, documents=13): gadget, list, mobil, soni, phone, 
        topic 17 (level=2, total_words=797, documents=8): robot, opera, voic, human, asimo, 
    topic 3 (level=1, total_words=14010, documents=222): phone, mobil, music, peopl, servic, 
        topic 4 (level=2, total_words=5900, documents=70): game, consol, soni, sale, releas, 
        topic 5 (level=2, total_words=11670, documents=124): secur, user, search, site, microsoft, 
        topic 18 (level=2, total_words=758, documents=8): softwar, bill, spywar, comput, law, 
        topic 19 (level=2, total_words=2238, documents=20): podcast, domain, appl, dvd, radio,
reallynotabot commented 6 years ago

The results also don't seem to be reproducible. Below is an example with same parameters, and calling the hlda.estimate method 2 times, leading to 2 different topic models.

Parameters

n_samples = 500       # no of iterations for the sampler
alpha = 10.0          # smoothing over level distributions
gamma = 1.0           # CRP smoothing parameter; number of imaginary customers at next, as yet unused table
eta = 0.1             # smoothing over topic-word distributions
num_levels = 3        # the number of levels in the tree
display_topics = 500   # the number of iterations between printing a brief summary of the topics so far
n_words = 5           # the number of most probable words to print for each topic after model estimation
with_weights = False  # whether to print the words with the weights

Code:

hlda = HierarchicalLDA(new_corpus, vocab, alpha=alpha, gamma=gamma, eta=eta, num_levels=num_levels)
hlda.estimate(n_samples, display_topics=display_topics, n_words=n_words, with_weights=with_weights)

Results in 1st iteration:

HierarchicalLDA sampling
.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 500
topic 0 (level=0, total_words=45389, documents=401): use, peopl, thi, one, could, 
    topic 1 (level=1, total_words=12264, documents=200): mobil, digit, music, technolog, phone, 
        topic 2 (level=2, total_words=7285, documents=63): secur, program, softwar, microsoft, email, 
        topic 3 (level=2, total_words=7447, documents=93): game, consol, soni, develop, nintendo, 
        topic 17 (level=2, total_words=2046, documents=16): gadget, mobil, list, robot, soni, 
        topic 28 (level=2, total_words=1400, documents=15): librari, user, project, internet, skype, 
        topic 31 (level=2, total_words=1581, documents=13): blog, appl, blogger, journalist, inform, 
    topic 4 (level=1, total_words=2700, documents=25): titl, like, halo, offer, graphic, 
        topic 13 (level=2, total_words=2757, documents=25): game, play, time, mobil, player, 
    topic 8 (level=1, total_words=2650, documents=60): music, digit, internet, player, appl, 
        topic 9 (level=2, total_words=2841, documents=25): file, network, softwar, legal, fileshar, 
        topic 18 (level=2, total_words=1848, documents=16): site, email, blog, websit, donat, 
        topic 21 (level=2, total_words=1689, documents=19): technolog, comput, robot, human, creativ, 
    topic 14 (level=1, total_words=3061, documents=71): game, servic, music, player, mobil, 
        topic 15 (level=2, total_words=3105, documents=34): phone, mobil, camera, peopl, broadband, 
        topic 16 (level=2, total_words=1989, documents=15): attack, site, data, spam, net, 
        topic 24 (level=2, total_words=2240, documents=14): dvd, highdefinit, game, film, technolog, 
        topic 34 (level=2, total_words=699, documents=8): china, podcast, map, cafe, chines, 
    topic 25 (level=1, total_words=1996, documents=45): mobil, technolog, data, broadband, phone, 
        topic 26 (level=2, total_words=1087, documents=12): ink, laser, light, use, uwb, 
        topic 30 (level=2, total_words=1005, documents=12): music, colour, project, wong, softwar, 
        topic 32 (level=2, total_words=655, documents=8): yahoo, search, googl, standard, carpent, 
        topic 33 (level=2, total_words=661, documents=6): rfid, tag, game, consum, survey, 
        topic 35 (level=2, total_words=873, documents=7): speed, network, second, mbp, hsdpa, 

Results with 2nd iteration:

HierarchicalLDA sampling
.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 500
topic 0 (level=0, total_words=46321, documents=401): peopl, technolog, thi, game, new, 
    topic 1 (level=1, total_words=851, documents=17): use, creativ, blogger, blog, user, 
        topic 2 (level=2, total_words=958, documents=11): phone, map, mobil, use, data, 
        topic 33 (level=2, total_words=790, documents=6): hiphop, millan, world, rap, music, 
    topic 3 (level=1, total_words=13170, documents=180): user, use, comput, net, system, 
        topic 4 (level=2, total_words=12859, documents=147): mobil, phone, use, music, gadget, 
        topic 9 (level=2, total_words=3601, documents=33): game, use, research, robot, comput, 
    topic 6 (level=1, total_words=1918, documents=47): user, search, web, onlin, googl, 
        topic 8 (level=2, total_words=2339, documents=30): servic, broadband, peopl, mobil, net, 
        topic 16 (level=2, total_words=1633, documents=17): appl, comput, mac, print, ipod, 
    topic 11 (level=1, total_words=3490, documents=61): system, firm, file, softwar, mani, 
        topic 12 (level=2, total_words=3430, documents=39): patent, softwar, network, would, law, 
        topic 29 (level=2, total_words=3903, documents=22): game, play, titl, halo, time, 
    topic 13 (level=1, total_words=1642, documents=45): blog, softwar, patent, compani, releas, 
        topic 14 (level=2, total_words=895, documents=11): award, best, game, gamer, prize, 
        topic 15 (level=2, total_words=1514, documents=18): game, consol, soni, nintendo, sale, 
        topic 20 (level=2, total_words=1006, documents=11): sky, offer, viewer, channel, programm, 
        topic 35 (level=2, total_words=624, documents=5): lift, sport, hunt, shoot, record, 
    topic 27 (level=1, total_words=3702, documents=51): secur, inform, microsoft, user, use, 
        topic 28 (level=2, total_words=3834, documents=44): site, attack, net, email, spam, 
        topic 34 (level=2, total_words=788, documents=7): ink, appl, elect, journalist, polit, 
joewandy commented 6 years ago

The number of topics is sampled from the nested CRP. You can tweak alpha and gamma to influence it. Maybe also a good idea to read the paper to understand the model.

Inference is done using Gibbs sampling, so the result could be different depending on starting values and whether the model.has converged. You can set the same random seed to get the same result each time, but in the current code, I only report the last posterior sample. For better results, you could try to average over the posterior samples after a suitable number of burn-in period.

reallynotabot commented 6 years ago

Do you mean the NCRPNode random_state code? I set the seed=0 by adding an argument in the self.root_node code but it still doesn't reproduce results.

self.root_node = NCRPNode(self.num_levels, self.vocab, self.random_state)

The HierarchicalLDA random_state seed is already set at 0.