vi3k6i5 / GuidedLDA

semi supervised guided topic model with custom guidedLDA
Mozilla Public License 2.0
497 stars 108 forks source link

Seed Confidence #34

Open browningbarrett opened 5 years ago

browningbarrett commented 5 years ago

Hello, could you explain a bit more about the way the seed_confidence parameter works?

I've been measuring convergence on a large corpus (public company earnings calls) by ranking likelihood and assigning points to topics where the seeded words are more likely to be in their seeded topic. As I tested different seed_confidence values I realized that the lower values were returning better convergence scores, which isn't what I expected.

Here's where the seed_confidence parameter is implemented: if w in seed_topics and random.random() < seed_confidence: z_new = seed_topics[w] else: z_new = i % n_topics

If I understand this correctly then a seed_confidence value of 1 should assign seed words to the seeded topic every time. A value of 0 would make every seed word randomly assigned. So am I getting better convergence with no seeding? Or do I not understand how the seed_confidence parameter works?