vi3k6i5 / GuidedLDA

semi supervised guided topic model with custom guidedLDA
Mozilla Public License 2.0
497 stars 109 forks source link

Seed co-presence per document. #9

Open gam-ba opened 6 years ago

gam-ba commented 6 years ago

Hello, Vikash.

To begin with, thanks for this excellent work. GuidedLDA is a really helpful and sharp tool for unsupervised "label propagation".

I'm not sure if it's really an issue, but I was wondering whether there was any way of weighting seed-term co-presence in documents. I'm working on a rather small corpus (~60,000 short comments from a change.org petition) where most of the comments mix at least two of the seeded topics.

However, when fitting the GuidedLDA model, it seems to assign the topic based on the first seed appearing in the document. This is not a problem per se, since we can retrieve the assignation values per topic per comment...

But here's the thing: the algorithm labels the comment from the first seed with a 0.9 value, when I would expect a much weaker assignation due to the co-presence of seed-terms.

Is there any way to consider this?

I'm thinking in something like the doc_topic_prior parameter, similar to Scikit's LDA implementation for the LDA's alpha parameter.

Again, thank you very much!

Guido

vi3k6i5 commented 6 years ago

@gam-ba What is the seed_confidence value you are using at the fit step?

model.fit(X, seed_topics=seed_topics, seed_confidence=0.15)
gam-ba commented 6 years ago

I've been using values ranging from 0.2 to 0.01. At least for my corpus, lower the value, better the result. That's expected, right?

vi3k6i5 commented 6 years ago

Ideally if you are getting good results for lower value of seed_confidence, then you should try without seeding as well.

Try the other fit method and see how that works for you.

model.fit(X)

Let me know how that goes then we can decide how to handle seeding (or if its even required) :)

PS: Email me if that's ok with you.

nickkimer commented 5 years ago

@gam-ba @vi3k6i5 did you have any success in finding a solution to your question? I have actually come across the same issue that you have described and would like to see if you could provide some insight. Thanks!